gprofng/doc/gprofng.info

*c42dbd0eSchristosThis is gprofng.info, produced by makeinfo version 6.8 from
*c42dbd0eSchristosgprofng.texi.
*c42dbd0eSchristos
*c42dbd0eSchristosThis document is the manual for gprofng, last updated 22 February 2022.
*c42dbd0eSchristos
*c42dbd0eSchristos   Copyright (C) 2022 Free Software Foundation, Inc.
*c42dbd0eSchristos
*c42dbd0eSchristos   Permission is granted to copy, distribute and/or modify this document
*c42dbd0eSchristosunder the terms of the GNU Free Documentation License, Version 1.3 or
*c42dbd0eSchristosany later version published by the Free Software Foundation; with no
*c42dbd0eSchristosInvariant Sections, with no Front-Cover texts, and with no Back-Cover
*c42dbd0eSchristosTexts.  A copy of the license is included in the section entitled "GNU
*c42dbd0eSchristosFree Documentation License."
*c42dbd0eSchristos
*c42dbd0eSchristosINFO-DIR-SECTION Software development
*c42dbd0eSchristosSTART-INFO-DIR-ENTRY
*c42dbd0eSchristos* gprofng: (gprofng).                    The next generation profiling tool for Linux
*c42dbd0eSchristosEND-INFO-DIR-ENTRY
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Top,  Next: Introduction,  Up: (dir)
*c42dbd0eSchristos
*c42dbd0eSchristosGNU Gprofng
*c42dbd0eSchristos***********
*c42dbd0eSchristos
*c42dbd0eSchristosThis document is the manual for gprofng, last updated 22 February 2022.
*c42dbd0eSchristos
*c42dbd0eSchristos   Copyright (C) 2022 Free Software Foundation, Inc.
*c42dbd0eSchristos
*c42dbd0eSchristos   Permission is granted to copy, distribute and/or modify this document
*c42dbd0eSchristosunder the terms of the GNU Free Documentation License, Version 1.3 or
*c42dbd0eSchristosany later version published by the Free Software Foundation; with no
*c42dbd0eSchristosInvariant Sections, with no Front-Cover texts, and with no Back-Cover
*c42dbd0eSchristosTexts.  A copy of the license is included in the section entitled "GNU
*c42dbd0eSchristosFree Documentation License."
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Introduction::           About this manual.
*c42dbd0eSchristos* Overview::               A brief overview of gprofng.
*c42dbd0eSchristos* A Mini Tutorial::        A short tutorial covering the key features.
*c42dbd0eSchristos* Terminology::            Various concepts and some terminology explained.
*c42dbd0eSchristos* Other Document Formats:: How to create this document in other formats.
*c42dbd0eSchristos* Index::                  The index.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristos-- The Detailed Node Listing --
*c42dbd0eSchristos
*c42dbd0eSchristosIntroduction
*c42dbd0eSchristos
*c42dbd0eSchristosOverview
*c42dbd0eSchristos
*c42dbd0eSchristos* Main Features::                     A high level overview.
*c42dbd0eSchristos* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
*c42dbd0eSchristos* Steps Needed to Create a Profile::  How to create a profile.
*c42dbd0eSchristos
*c42dbd0eSchristosA Mini Tutorial
*c42dbd0eSchristos
*c42dbd0eSchristos* Getting Started::                 The basics of profiling with gprofng().
*c42dbd0eSchristos* Support for Multithreading::      Commands specific to multithreaded applications.
*c42dbd0eSchristos* Viewing Multiple Experiments::    Analyze multiple experiments.
*c42dbd0eSchristos* Profile Hardware Event Counters:: How to use hardware event counters.
*c42dbd0eSchristos* Java Profiling::                  How to profile a Java application.
*c42dbd0eSchristos
*c42dbd0eSchristosTerminology
*c42dbd0eSchristos
*c42dbd0eSchristos* The Program Counter::                    What is a Program Counter?
*c42dbd0eSchristos* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
*c42dbd0eSchristos* Metric Definitions::                     Definitions associated with metrics.
*c42dbd0eSchristos* The Viewmode::                           Select the way call stacks are presented.
*c42dbd0eSchristos* The Selection List::                     How to define a selection.
*c42dbd0eSchristos* Load Objects and Functions::             The components in an application.
*c42dbd0eSchristos* The Concept of a CPU in gprofng:: The definition of a CPU.
*c42dbd0eSchristos* Hardware Event Counters Explained::      What are event counters?
*c42dbd0eSchristos* apath::                                  Our generic definition of a path.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Introduction,  Next: Overview,  Prev: Top,  Up: Top
*c42dbd0eSchristos
*c42dbd0eSchristos1 Introduction
*c42dbd0eSchristos**************
*c42dbd0eSchristos
*c42dbd0eSchristosThe gprofng tool is the next generation profiler for Linux.  It consists
*c42dbd0eSchristosof various commands to generate and display profile information.
*c42dbd0eSchristos
*c42dbd0eSchristos   This manual starts with a tutorial how to create and interpret a
*c42dbd0eSchristosprofile.  This part is highly practical and has the goal to get users up
*c42dbd0eSchristosto speed as quickly as possible.  As soon as possible, we would like to
*c42dbd0eSchristosshow you how to get your first profile on your screen.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is followed by more examples, covering many of the features.  At
*c42dbd0eSchristosthe end of this tutorial, you should feel confident enough to tackle the
*c42dbd0eSchristosmore complex tasks.
*c42dbd0eSchristos
*c42dbd0eSchristos   In a future update a more formal reference manual will be included as
*c42dbd0eSchristoswell.  Since even in this tutorial we use certain terminology, we have
*c42dbd0eSchristosincluded a chapter with descriptions at the end.  In case you encounter
*c42dbd0eSchristosunfamiliar wordings or terminology, please check this chapter.
*c42dbd0eSchristos
*c42dbd0eSchristos   One word of caution.  In several cases we had to somewhat tweak the
*c42dbd0eSchristosscreen output in order to make it fit.  This is why the output may look
*c42dbd0eSchristossomewhat different when you try things yourself.
*c42dbd0eSchristos
*c42dbd0eSchristos   For now, we wish you a smooth profiling experience with gprofng and
*c42dbd0eSchristosgood luck tackling performance bottlenecks.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Overview,  Next: A Mini Tutorial,  Prev: Introduction,  Up: Top
*c42dbd0eSchristos
*c42dbd0eSchristos2 A Brief Overview of gprofng
*c42dbd0eSchristos*****************************
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Main Features::                     A high level overview.
*c42dbd0eSchristos* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
*c42dbd0eSchristos* Steps Needed to Create a Profile::  How to create a profile.
*c42dbd0eSchristos
*c42dbd0eSchristosBefore we cover this tool in quite some detail, we start with a brief
*c42dbd0eSchristosoverview of what it is, and the main features.  Since we know that many
*c42dbd0eSchristosof you would like to get started rightaway, already in this first
*c42dbd0eSchristoschapter we explain the basics of profiling with 'gprofng'.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Main Features,  Next: Sampling versus Tracing,  Up: Overview
*c42dbd0eSchristos
*c42dbd0eSchristos2.1 Main Features
*c42dbd0eSchristos=================
*c42dbd0eSchristos
*c42dbd0eSchristosThese are the main features of the gprofng tool:
*c42dbd0eSchristos
*c42dbd0eSchristos   * Profiling is supported for an application written in C, C++, Java,
*c42dbd0eSchristos     or Scala.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Shared libraries are supported.  The information is presented at
*c42dbd0eSchristos     the instruction level.
*c42dbd0eSchristos
*c42dbd0eSchristos   * The following multithreading programming models are supported:
*c42dbd0eSchristos     Pthreads, OpenMP, and Java threads.
*c42dbd0eSchristos
*c42dbd0eSchristos   * This tool works with unmodified production level executables.
*c42dbd0eSchristos     There is no need to recompile the code, but if the '-g' option has
*c42dbd0eSchristos     been used when building the application, source line level
*c42dbd0eSchristos     information is available.
*c42dbd0eSchristos
*c42dbd0eSchristos   * The focus is on support for code generated with the 'gcc' compiler,
*c42dbd0eSchristos     but there is some limited support for the 'icc' compiler as well.
*c42dbd0eSchristos     Future improvements and enhancements will focus on 'gcc' though.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Processors from Intel, AMD, and Arm are supported, but the level of
*c42dbd0eSchristos     support depends on the architectural details.  In particular,
*c42dbd0eSchristos     hardware event counters may not be supported.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Several views into the data are supported.  For example, a function
*c42dbd0eSchristos     overview where the time is spent, but also a source line,
*c42dbd0eSchristos     disassembly, call tree and a caller-callees overview are available.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Through filters, the user can zoom in on an area of interest.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Two or more profiles can be aggregated, or used in a comparison.
*c42dbd0eSchristos     This comparison can be obtained at the function, source line, and
*c42dbd0eSchristos     disassembly level.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Through a scripting language, and customization of the metrics
*c42dbd0eSchristos     shown, the generation and creation of a profile can be fully
*c42dbd0eSchristos     automated and provide tailored output.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Sampling versus Tracing,  Next: Steps Needed to Create a Profile,  Prev: Main Features,  Up: Overview
*c42dbd0eSchristos
*c42dbd0eSchristos2.2 Sampling versus Tracing
*c42dbd0eSchristos===========================
*c42dbd0eSchristos
*c42dbd0eSchristosA key difference with some other profiling tools is that the main data
*c42dbd0eSchristoscollection command 'gprofng collect app' mostly uses Program Counter
*c42dbd0eSchristos(PC) sampling under the hood.
*c42dbd0eSchristos
*c42dbd0eSchristos   With _sampling_, the executable is stopped at regular intervals.
*c42dbd0eSchristosEach time it is halted, key information is gathered and stored.  This
*c42dbd0eSchristosincludes the Program Counter that keeps track of where the execution is.
*c42dbd0eSchristosHence the name.
*c42dbd0eSchristos
*c42dbd0eSchristos   Together with operational data, this information is stored in the
*c42dbd0eSchristosexperiment directory and can be viewed in the second phase.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, the PC information is used to derive where the program
*c42dbd0eSchristoswas when it was halted.  Since the sampling interval is known, it is
*c42dbd0eSchristosrelatively easy to derive how much time was spent in the various parts
*c42dbd0eSchristosof the program.
*c42dbd0eSchristos
*c42dbd0eSchristos   The opposite technique is generally referred to as _tracing_.  With
*c42dbd0eSchristostracing, the target is instrumented with specific calls that collect the
*c42dbd0eSchristosrequested information.
*c42dbd0eSchristos
*c42dbd0eSchristos   These are some of the pros and cons of PC sampling verus tracing:
*c42dbd0eSchristos
*c42dbd0eSchristos   * Since there is no need to recompile, existing executables can be
*c42dbd0eSchristos     used and the profile measures the behaviour of exactly the same
*c42dbd0eSchristos     executable that is used in production runs.
*c42dbd0eSchristos
*c42dbd0eSchristos     With sampling, one inherently profiles a different executable
*c42dbd0eSchristos     because the calls to the instrumentation library may affect the
*c42dbd0eSchristos     compiler optimizations and run time behaviour.
*c42dbd0eSchristos
*c42dbd0eSchristos   * With sampling, there are very few restrictions on what can be
*c42dbd0eSchristos     profiled and even without access to the source code, a basic
*c42dbd0eSchristos     profile can be made.
*c42dbd0eSchristos
*c42dbd0eSchristos   * A downside of sampling is that, depending on the sampling
*c42dbd0eSchristos     frequency, small functions may be missed or not captured
*c42dbd0eSchristos     accurately.  Although this is rare, this may happen and is the
*c42dbd0eSchristos     reason why the user has control over the sampling rate.
*c42dbd0eSchristos
*c42dbd0eSchristos   * While tracing produces precise information, sampling is statistical
*c42dbd0eSchristos     in nature.  As a result, small variations may occur across
*c42dbd0eSchristos     seemingly identical runs.  We have not observed more than a few
*c42dbd0eSchristos     percent deviation though.  Especially if the target job executed
*c42dbd0eSchristos     for a sufficiently long time.
*c42dbd0eSchristos
*c42dbd0eSchristos   * With sampling, it is not possible to get an accurate count how
*c42dbd0eSchristos     often functions are called.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Steps Needed to Create a Profile,  Prev: Sampling versus Tracing,  Up: Overview
*c42dbd0eSchristos
*c42dbd0eSchristos2.3 Steps Needed to Create a Profile
*c42dbd0eSchristos====================================
*c42dbd0eSchristos
*c42dbd0eSchristosCreating a profile takes two steps.  First the profile data needs to be
*c42dbd0eSchristosgenerated.  This is followed by a viewing step to create a report from
*c42dbd0eSchristosthe information that has been gathered.
*c42dbd0eSchristos
*c42dbd0eSchristos   Every gprofng command starts with 'gprofng', the name of the driver.
*c42dbd0eSchristosThis is followed by a keyword to define the high level functionality.
*c42dbd0eSchristosDepending on this keyword, a third qualifier may be needed to further
*c42dbd0eSchristosnarrow down the request.  This combination is then followed by options
*c42dbd0eSchristosthat are specific to the functionality desired.
*c42dbd0eSchristos
*c42dbd0eSchristos   The command to gather, or "collect", the performance data is called
*c42dbd0eSchristos'gprofng collect app'.  Aside from numerous options, this command takes
*c42dbd0eSchristosthe name of the target executable as an input parameter.
*c42dbd0eSchristos
*c42dbd0eSchristos   Upon completion of the run, the performance data can be found in the
*c42dbd0eSchristosnewly created experiment directory.
*c42dbd0eSchristos
*c42dbd0eSchristos   Unless explicitly specified otherwise, a default name for this
*c42dbd0eSchristosdirectory is chosen.  The name is 'test.<n>.er' where 'n' is the first
*c42dbd0eSchristosinteger number not in use yet for such a name.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, the first time 'gprofng collect app' is invoked, an
*c42dbd0eSchristosexperiment directory with the name 'test.1.er' is created.
*c42dbd0eSchristos
*c42dbd0eSchristos   Upon a subsequent invocation of 'gprofng collect app' in the same
*c42dbd0eSchristosdirectory, an experiment directory with the name 'test.2.er' will be
*c42dbd0eSchristoscreated, and so forth.
*c42dbd0eSchristos
*c42dbd0eSchristos   Note that 'gprofng collect app' supports an option to explicitly name
*c42dbd0eSchristosthe experiment directory.  Outside of the restriction that the name of
*c42dbd0eSchristosthis directory has to end with '.er', any valid directory name can be
*c42dbd0eSchristosused for this.
*c42dbd0eSchristos
*c42dbd0eSchristos   Now that we have the performance data, the next step is to display
*c42dbd0eSchristosit.
*c42dbd0eSchristos
*c42dbd0eSchristos   The most commonly used command to view the performance information is
*c42dbd0eSchristos'gprofng display text'.  This is a very extensive and customizable tool
*c42dbd0eSchristosthat produces the information in ASCII format.
*c42dbd0eSchristos
*c42dbd0eSchristos   Another option is to use 'gprofng display html'.  This tool generates
*c42dbd0eSchristosa directory with files in html format.  These can be viewed in a
*c42dbd0eSchristosbrowser, allowing for easy navigation through the profile data.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: A Mini Tutorial,  Next: Terminology,  Prev: Overview,  Up: Top
*c42dbd0eSchristos
*c42dbd0eSchristos3 A Mini Tutorial
*c42dbd0eSchristos*****************
*c42dbd0eSchristos
*c42dbd0eSchristosIn this chapter we present and discuss the main functionality of
*c42dbd0eSchristos'gprofng'.  This will be a practical approach, using an example code to
*c42dbd0eSchristosgenerate profile data and show how to get various performance reports.
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Getting Started::                 The basics of profiling with gprofng().
*c42dbd0eSchristos* Support for Multithreading::      Commands specific to multithreaded applications.
*c42dbd0eSchristos* Viewing Multiple Experiments::    Analyze multiple experiments.
*c42dbd0eSchristos* Profile Hardware Event Counters:: How to use hardware event counters.
*c42dbd0eSchristos* Java Profiling::                  How to profile a Java application.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Getting Started,  Next: Support for Multithreading,  Up: A Mini Tutorial
*c42dbd0eSchristos
*c42dbd0eSchristos3.1 Getting Started
*c42dbd0eSchristos===================
*c42dbd0eSchristos
*c42dbd0eSchristosThe information presented here provides a good and common basis for many
*c42dbd0eSchristosprofiling tasks, but there are more features that you may want to
*c42dbd0eSchristosleverage.
*c42dbd0eSchristos
*c42dbd0eSchristos   These are covered in subsequent sections in this chapter.
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* The Example Program::                        A description of the example program used.
*c42dbd0eSchristos* A First Profile::                            How to get the first profile.
*c42dbd0eSchristos* The Source Code View::                       Display the metrics in the source code.
*c42dbd0eSchristos* The Disassembly View::                       Display the metrics at the instruction level.
*c42dbd0eSchristos* Display and Define the Metrics::             An example how to customize the metrics.
*c42dbd0eSchristos* A First Customization of the Output::        An example how to customize the output.
*c42dbd0eSchristos* Name the Experiment Directory::              Change the name of the experiment directory.
*c42dbd0eSchristos* Control the Number of Lines in the Output::  Change the number of lines in the tables.
*c42dbd0eSchristos* Sorting the Performance Data::               How to set the metric to sort by.
*c42dbd0eSchristos* Scripting::                                  Use a script to execute the commands.
*c42dbd0eSchristos* A More Elaborate Example::                   An example of customization.
*c42dbd0eSchristos* The Call Tree::                              Display the dynamic call tree.
*c42dbd0eSchristos* More Information on the Experiment::         How to get additional statistics.
*c42dbd0eSchristos* Control the Sampling Frequency::             How to control the sampling granularity.
*c42dbd0eSchristos* Information on Load Objects::                How to get more information on load objects.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Example Program,  Next: A First Profile,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.1 The Example Program
*c42dbd0eSchristos-------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThroughout this guide we use the same example C code that implements the
*c42dbd0eSchristosmultiplication of a vector of length n by an m by n matrix.  The result
*c42dbd0eSchristosis stored in a vector of length m.  The algorithm has been parallelized
*c42dbd0eSchristosusing Posix Threads, or Pthreads for short.
*c42dbd0eSchristos
*c42dbd0eSchristos   The code was built using the 'gcc' compiler and the name of the
*c42dbd0eSchristosexecutable is mxv-pthreads.exe.
*c42dbd0eSchristos
*c42dbd0eSchristos   The matrix sizes can be set through the '-m' and '-n' options.  The
*c42dbd0eSchristosnumber of threads is set with the '-t' option.  To increase the duration
*c42dbd0eSchristosof the run, the multiplication is executed repeatedly.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is an example that multiplies a 3000 by 2000 matrix with a
*c42dbd0eSchristosvector of length 2000 using 2 threads:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2
*c42dbd0eSchristos     mxv: error check passed - rows = 3000 columns = 2000 threads = 2
*c42dbd0eSchristos     $
*c42dbd0eSchristos
*c42dbd0eSchristos   The program performs an internal check to verify the results are
*c42dbd0eSchristoscorrect.  The result of this check is printed, followed by the matrix
*c42dbd0eSchristossizes and the number of threads used.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: A First Profile,  Next: The Source Code View,  Prev: The Example Program,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.2 A First Profile
*c42dbd0eSchristos---------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe first step is to collect the performance data.  It is important to
*c42dbd0eSchristosremember that much more information is gathered than may be shown by
*c42dbd0eSchristosdefault.  Often a single data collection run is sufficient to get a lot
*c42dbd0eSchristosof insight.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'gprofng collect app' command is used for the data collection.
*c42dbd0eSchristosNothing needs to be changed in the way the application is executed.  The
*c42dbd0eSchristosonly difference is that it is now run under control of the tool, as
*c42dbd0eSchristosshown below:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1
*c42dbd0eSchristos
*c42dbd0eSchristos   This command produces the following output:
*c42dbd0eSchristos
*c42dbd0eSchristos     Creating experiment database test.1.er (Process ID: 2416504) ...
*c42dbd0eSchristos     mxv: error check passed - rows = 3000 columns = 2000 threads = 1
*c42dbd0eSchristos
*c42dbd0eSchristos   We see the message that a directory with the name 'test.1.er' has
*c42dbd0eSchristosbeen created.  The application then completes as usual and we have our
*c42dbd0eSchristosfirst experiment directory that can be analyzed.
*c42dbd0eSchristos
*c42dbd0eSchristos   The tool we use for this is called 'gprofng display text'.  It takes
*c42dbd0eSchristosthe name of the experiment directory as an argument.
*c42dbd0eSchristos
*c42dbd0eSchristos   If invoked this way, the tool starts in the interactive _interpreter_
*c42dbd0eSchristosmode.  While in this environment, commands can be given and the tool
*c42dbd0eSchristosresponds.  This is illustrated below:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text test.1.er
*c42dbd0eSchristos     Warning: History and command editing is not supported on this system.
*c42dbd0eSchristos     (gp-display-text) quit
*c42dbd0eSchristos     $
*c42dbd0eSchristos
*c42dbd0eSchristos   While useful in certain cases, we prefer to use this tool in command
*c42dbd0eSchristosline mode, by specifying the commands to be issued when invoking the
*c42dbd0eSchristostool.  The way to do this is to prepend the command with a hyphen ('-')
*c42dbd0eSchristosif used on the command line.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, with the 'functions' command we request a list of the
*c42dbd0eSchristosfunctions that have been executed and their respective CPU times:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -functions test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -functions test.1.er
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.     Incl.      Name
*c42dbd0eSchristos     Total     Total
*c42dbd0eSchristos     CPU sec.  CPU sec.
*c42dbd0eSchristos     2.272     2.272      <Total>
*c42dbd0eSchristos     2.160     2.160      mxv_core
*c42dbd0eSchristos     0.047     0.103      init_data
*c42dbd0eSchristos     0.030     0.043      erand48_r
*c42dbd0eSchristos     0.013     0.013      __drand48_iterate
*c42dbd0eSchristos     0.013     0.056      drand48
*c42dbd0eSchristos     0.008     0.010      _int_malloc
*c42dbd0eSchristos     0.001     0.001      brk
*c42dbd0eSchristos     0.001     0.002      sysmalloc
*c42dbd0eSchristos     0.        0.001      __default_morecore
*c42dbd0eSchristos     0.        0.113      __libc_start_main
*c42dbd0eSchristos     0.        0.010      allocate_data
*c42dbd0eSchristos     0.        2.160      collector_root
*c42dbd0eSchristos     0.        2.160      driver_mxv
*c42dbd0eSchristos     0.        0.113      main
*c42dbd0eSchristos     0.        0.010      malloc
*c42dbd0eSchristos     0.        0.001      sbrk
*c42dbd0eSchristos
*c42dbd0eSchristos   As easy and simple as these steps are, we do have a first profile of
*c42dbd0eSchristosour program!  There are three columns.  The first two contain the _Total
*c42dbd0eSchristosCPU Time_, which is the sum of the user and system time.  *Note
*c42dbd0eSchristosInclusive and Exclusive Metrics:: for an explanation of "exclusive" and
*c42dbd0eSchristos"inclusive" times.
*c42dbd0eSchristos
*c42dbd0eSchristos   The first line echoes the metric that is used to sort the output.  By
*c42dbd0eSchristosdefault, this is the exclusive CPU time, but the sort metric can be
*c42dbd0eSchristoschanged by the user.
*c42dbd0eSchristos
*c42dbd0eSchristos   We then see three columns with the exclusive and inclusive CPU times,
*c42dbd0eSchristosplus the name of the function.
*c42dbd0eSchristos
*c42dbd0eSchristos   The function with the name '<Total>' is not a user function, but is
*c42dbd0eSchristosintroduced by 'gprofng' and is used to display the accumulated metric
*c42dbd0eSchristosvalues.  In this case, we see that the total CPU time of this job was
*c42dbd0eSchristos'2.272' seconds.
*c42dbd0eSchristos
*c42dbd0eSchristos   With '2.160' seconds, function 'mxv_core' is the most time consuming
*c42dbd0eSchristosfunction.  It is also a leaf function.
*c42dbd0eSchristos
*c42dbd0eSchristos   The next function in the list is 'init_data'.  Although the CPU time
*c42dbd0eSchristosspent in this part is negligible, this is an interesting entry because
*c42dbd0eSchristosthe inclusive CPU time of '0.103' seconds is higher than the exclusive
*c42dbd0eSchristosCPU time of '0.047' seconds.  Clearly it is calling another function, or
*c42dbd0eSchristoseven more than one function.  *Note The Call Tree:: for the details how
*c42dbd0eSchristosto get more information on this.
*c42dbd0eSchristos
*c42dbd0eSchristos   The function 'collector_root' does not look familiar.  It is one of
*c42dbd0eSchristosthe internal functions used by 'gprofng collect app' and can be ignored.
*c42dbd0eSchristosWhile the inclusive time is high, the exclusive time is zero.  This
*c42dbd0eSchristosmeans it doesn't contribute to the performance.
*c42dbd0eSchristos
*c42dbd0eSchristos   The question is how we know where this function originates from?
*c42dbd0eSchristosThere is a very useful command to get more details on a function.  *Note
*c42dbd0eSchristosInformation on Load Objects::.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Source Code View,  Next: The Disassembly View,  Prev: A First Profile,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.3 The Source Code View
*c42dbd0eSchristos--------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosIn general, you would like to focus the tuning efforts on the most time
*c42dbd0eSchristosconsuming part(s) of the program.  In this case that is easy, since
*c42dbd0eSchristos2.160 seconds on a total of 2.272 seconds is spent in function
*c42dbd0eSchristos'mxv_core'.  That is 95% of the total and it is time to dig deeper and
*c42dbd0eSchristoslook at the time distribution at the source code level.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'source' command is used to accomplish this.  It takes the name
*c42dbd0eSchristosof the function, not the source filename, as an argument.  This is
*c42dbd0eSchristosdemonstrated below, where the 'gprofng display text' command is used to
*c42dbd0eSchristosshow the annotated source listing of function 'mxv_core'.
*c42dbd0eSchristos
*c42dbd0eSchristos   Please note that the source code has to be compiled with the '-g'
*c42dbd0eSchristosoption in order for the source code feature to work.  Otherwise the
*c42dbd0eSchristoslocation can not be determined.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -source mxv_core test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos   The slightly modified output is as follows:
*c42dbd0eSchristos
*c42dbd0eSchristos     Source file: <apath>/mxv.c
*c42dbd0eSchristos     Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
*c42dbd0eSchristos     Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
*c42dbd0eSchristos
*c42dbd0eSchristos        Excl.     Incl.
*c42dbd0eSchristos        Total     Total
*c42dbd0eSchristos        CPU sec.  CPU sec.
*c42dbd0eSchristos
*c42dbd0eSchristos        <lines deleted>
*c42dbd0eSchristos                                    <Function: mxv_core>
*c42dbd0eSchristos        0.        0.             32. void __attribute__ ((noinline))
*c42dbd0eSchristos                                     mxv_core (
*c42dbd0eSchristos                                     uint64_t row_index_start,
*c42dbd0eSchristos                                     uint64_t row_index_end,
*c42dbd0eSchristos                                     uint64_t m, uint64_t n,
*c42dbd0eSchristos                                     double **restrict A,
*c42dbd0eSchristos                                     double *restrict b,
*c42dbd0eSchristos                                     double *restrict c)
*c42dbd0eSchristos        0.        0.             33. {
*c42dbd0eSchristos        0.        0.             34.    for (uint64_t i=row_index_start;
*c42dbd0eSchristos                                             i<=row_index_end; i++) {
*c42dbd0eSchristos        0.        0.             35.       double row_sum = 0.0;
*c42dbd0eSchristos     ## 1.687     1.687          36.       for (int64_t j=0; j<n; j++)
*c42dbd0eSchristos        0.473     0.473          37.          row_sum += A[i][j]*b[j];
*c42dbd0eSchristos        0.        0.             38.       c[i] = row_sum;
*c42dbd0eSchristos                                 39.    }
*c42dbd0eSchristos        0.        0.             40. }
*c42dbd0eSchristos
*c42dbd0eSchristos   The first three lines provide information on the location of the
*c42dbd0eSchristossource file, the object file and the load object (*Note Load Objects and
*c42dbd0eSchristosFunctions::).
*c42dbd0eSchristos
*c42dbd0eSchristos   Function 'mxv_core' is part of a source file that has other functions
*c42dbd0eSchristosas well.  These functions will be shown, but without timing information.
*c42dbd0eSchristosThey have been removed in the output shown above.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is followed by the annotated source code listing.  The selected
*c42dbd0eSchristosmetrics are shown first, followed by a source line number, and the
*c42dbd0eSchristossource code.  The most time consuming line(s) are marked with the '##'
*c42dbd0eSchristossymbol.  In this way they are easier to find.
*c42dbd0eSchristos
*c42dbd0eSchristos   What we see is that all of the time is spent in lines 36-37.
*c42dbd0eSchristos
*c42dbd0eSchristos   A related command sometimes comes handy as well.  It is called
*c42dbd0eSchristos'lines' and displays a list of the source lines and their metrics,
*c42dbd0eSchristosordered according to the current sort metric (*Note Sorting the
*c42dbd0eSchristosPerformance Data::).
*c42dbd0eSchristos
*c42dbd0eSchristos   Below the command and the output.  For lay-out reasons, only the top
*c42dbd0eSchristos10 is shown here and the last part of the text on some lines has been
*c42dbd0eSchristosreplaced by dots.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -lines test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Lines sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.     Incl.  Name
*c42dbd0eSchristos     Total     Total
*c42dbd0eSchristos     CPU sec.  CPU sec.
*c42dbd0eSchristos     2.272     2.272  <Total>
*c42dbd0eSchristos     1.687     1.687  mxv_core, line 36 in "mxv.c"
*c42dbd0eSchristos     0.473     0.473  mxv_core, line 37 in "mxv.c"
*c42dbd0eSchristos     0.032     0.088  init_data, line 72 in "manage_data.c"
*c42dbd0eSchristos     0.030     0.043  <Function: erand48_r, instructions without line numbers>
*c42dbd0eSchristos     0.013     0.013  <Function: __drand48_iterate, instructions without ...>
*c42dbd0eSchristos     0.013     0.056  <Function: drand48, instructions without line numbers>
*c42dbd0eSchristos     0.012     0.012  init_data, line 77 in "manage_data.c"
*c42dbd0eSchristos     0.008     0.010  <Function: _int_malloc, instructions without ...>
*c42dbd0eSchristos     0.003     0.003  init_data, line 71 in "manage_data.c"
*c42dbd0eSchristos
*c42dbd0eSchristos   What this overview immediately highlights is that the next most time
*c42dbd0eSchristosconsuming source line takes 0.032 seconds only.  With an inclusive time
*c42dbd0eSchristosof 0.088 seconds, it is also clear that this branch of the code does not
*c42dbd0eSchristosimpact the performance.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Disassembly View,  Next: Display and Define the Metrics,  Prev: The Source Code View,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.4 The Disassembly View
*c42dbd0eSchristos--------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe source view is very useful to obtain more insight where the time is
*c42dbd0eSchristosspent, but sometimes this is not sufficient.  This is when the
*c42dbd0eSchristosdisassembly view comes in.  It is activated with the 'disasm' command
*c42dbd0eSchristosand as with the source view, it displays an annotated listing.  In this
*c42dbd0eSchristoscase it shows the instructions with the metrics, interleaved with the
*c42dbd0eSchristossource lines.  The instructions have a reference in square brackets ('['
*c42dbd0eSchristosand ']') to the source line they correspond to.
*c42dbd0eSchristos
*c42dbd0eSchristosThis is what we get for our example:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -disasm mxv_core test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Source file: <apath>/mxv.c
*c42dbd0eSchristos     Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
*c42dbd0eSchristos     Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
*c42dbd0eSchristos
*c42dbd0eSchristos        Excl.     Incl.
*c42dbd0eSchristos        Total     Total
*c42dbd0eSchristos        CPU sec.  CPU sec.
*c42dbd0eSchristos
*c42dbd0eSchristos        <lines deleted>
*c42dbd0eSchristos                             32. void __attribute__ ((noinline))
*c42dbd0eSchristos                                 mxv_core (
*c42dbd0eSchristos                                 uint64_t row_index_start,
*c42dbd0eSchristos                                 uint64_t row_index_end,
*c42dbd0eSchristos                                 uint64_t m, uint64_t n,
*c42dbd0eSchristos                                 double **restrict A,
*c42dbd0eSchristos                                 double *restrict b,
*c42dbd0eSchristos                                 double *restrict c)
*c42dbd0eSchristos                             33. {
*c42dbd0eSchristos                                 <Function: mxv_core>
*c42dbd0eSchristos        0.        0.             [33]   4021ba:  mov    0x8(%rsp),%r10
*c42dbd0eSchristos                             34.    for (uint64_t i=row_index_start;
*c42dbd0eSchristos                                         i<=row_index_end; i++) {
*c42dbd0eSchristos        0.        0.             [34]   4021bf:  cmp    %rsi,%rdi
*c42dbd0eSchristos        0.        0.             [34]   4021c2:  jbe    0x37
*c42dbd0eSchristos        0.        0.             [34]   4021c4:  ret
*c42dbd0eSchristos                             35.        double row_sum = 0.0;
*c42dbd0eSchristos                             36.        for (int64_t j=0; j<n; j++)
*c42dbd0eSchristos                             37.           row_sum += A[i][j]*b[j];
*c42dbd0eSchristos        0.        0.             [37]   4021c5:  mov    (%r8,%rdi,8),%rdx
*c42dbd0eSchristos        0.        0.             [36]   4021c9:  mov    $0x0,%eax
*c42dbd0eSchristos        0.        0.             [35]   4021ce:  pxor   %xmm1,%xmm1
*c42dbd0eSchristos        0.002     0.002          [37]   4021d2:  movsd  (%rdx,%rax,8),%xmm0
*c42dbd0eSchristos        0.096     0.096          [37]   4021d7:  mulsd  (%r9,%rax,8),%xmm0
*c42dbd0eSchristos        0.375     0.375          [37]   4021dd:  addsd  %xmm0,%xmm1
*c42dbd0eSchristos     ## 1.683     1.683          [36]   4021e1:  add    $0x1,%rax
*c42dbd0eSchristos        0.004     0.004          [36]   4021e5:  cmp    %rax,%rcx
*c42dbd0eSchristos        0.        0.             [36]   4021e8:  jne    0xffffffffffffffea
*c42dbd0eSchristos                             38.        c[i] = row_sum;
*c42dbd0eSchristos        0.        0.             [38]   4021ea:  movsd  %xmm1,(%r10,%rdi,8)
*c42dbd0eSchristos        0.        0.             [34]   4021f0:  add    $0x1,%rdi
*c42dbd0eSchristos        0.        0.             [34]   4021f4:  cmp    %rdi,%rsi
*c42dbd0eSchristos        0.        0.             [34]   4021f7:  jb     0xd
*c42dbd0eSchristos        0.        0.             [35]   4021f9:  pxor   %xmm1,%xmm1
*c42dbd0eSchristos        0.        0.             [36]   4021fd:  test   %rcx,%rcx
*c42dbd0eSchristos        0.        0.             [36]   402200:  jne    0xffffffffffffffc5
*c42dbd0eSchristos        0.        0.             [36]   402202:  jmp    0xffffffffffffffe8
*c42dbd0eSchristos                             39.    }
*c42dbd0eSchristos                             40. }
*c42dbd0eSchristos        0.        0.             [40]   402204:  ret
*c42dbd0eSchristos
*c42dbd0eSchristos   For each instruction, the timing values are given and we can exactly
*c42dbd0eSchristoswhich ones are the most expensive.  As with the source level view, the
*c42dbd0eSchristosmost expensive instructions are market with the '##' symbol.
*c42dbd0eSchristos
*c42dbd0eSchristos   As illustrated below and similar to the 'lines' command, we can get
*c42dbd0eSchristosan overview of the instructions executed by using the 'pcs' command.
*c42dbd0eSchristos
*c42dbd0eSchristosBelow the command and the output, which again has been restricted to 10
*c42dbd0eSchristoslines:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -pcs test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos     PCs sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.     Incl.      Name
*c42dbd0eSchristos     Total     Total
*c42dbd0eSchristos     CPU sec.  CPU sec.
*c42dbd0eSchristos     2.272     2.272  <Total>
*c42dbd0eSchristos     1.683     1.683  mxv_core + 0x00000027, line 36 in "mxv.c"
*c42dbd0eSchristos     0.375     0.375  mxv_core + 0x00000023, line 37 in "mxv.c"
*c42dbd0eSchristos     0.096     0.096  mxv_core + 0x0000001D, line 37 in "mxv.c"
*c42dbd0eSchristos     0.027     0.027  init_data + 0x000000BD, line 72 in "manage_data.c"
*c42dbd0eSchristos     0.012     0.012  init_data + 0x00000117, line 77 in "manage_data.c"
*c42dbd0eSchristos     0.008     0.008  _int_malloc + 0x00000A45
*c42dbd0eSchristos     0.007     0.007  erand48_r + 0x00000062
*c42dbd0eSchristos     0.006     0.006  drand48 + 0x00000000
*c42dbd0eSchristos     0.005     0.005  __drand48_iterate + 0x00000005
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Display and Define the Metrics,  Next: A First Customization of the Output,  Prev: The Disassembly View,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.5 Display and Define the Metrics
*c42dbd0eSchristos------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe default metrics shown by 'gprofng display text' are useful, but
*c42dbd0eSchristosthere is more recorded than displayed.  We can customize the values
*c42dbd0eSchristosshown by defining the metrics ourselves.
*c42dbd0eSchristos
*c42dbd0eSchristos   There are two commands related to changing the metrics shown:
*c42dbd0eSchristos'metric_list' and 'metrics'.
*c42dbd0eSchristos
*c42dbd0eSchristos   The first command shows the metrics in use, plus all the metrics that
*c42dbd0eSchristoshave been stored as part of the experiment.  The second command may be
*c42dbd0eSchristosused to define the metric list.
*c42dbd0eSchristos
*c42dbd0eSchristos   In our example we get the following values for the metrics:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -metric_list test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Current metrics: e.totalcpu:i.totalcpu:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
*c42dbd0eSchristos     Available metrics:
*c42dbd0eSchristos        Exclusive Total CPU Time: e.%totalcpu
*c42dbd0eSchristos        Inclusive Total CPU Time: i.%totalcpu
*c42dbd0eSchristos                            Size: size
*c42dbd0eSchristos                      PC Address: address
*c42dbd0eSchristos                            Name: name
*c42dbd0eSchristos
*c42dbd0eSchristos   This shows the metrics currently in use, the metric that is used to
*c42dbd0eSchristossort the data and all the metrics that have been recorded, but are not
*c42dbd0eSchristosnecessarily shown.
*c42dbd0eSchristos
*c42dbd0eSchristos   In this case, the default metrics are set to the exclusive and
*c42dbd0eSchristosinclusive total CPU times, plus the name of the function, or load
*c42dbd0eSchristosobject.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'metrics' command is used to define the metrics that need to be
*c42dbd0eSchristosdisplayed.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, to display the exclusive total CPU time, both as a
*c42dbd0eSchristosnumber and a percentage, use the following metric definition:
*c42dbd0eSchristos'e.%totalcpu'
*c42dbd0eSchristos
*c42dbd0eSchristos   Since the metrics can be tailored for different views, there is a way
*c42dbd0eSchristosto reset them to the default.  This is done through the special keyword
*c42dbd0eSchristos'default'.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: A First Customization of the Output,  Next: Name the Experiment Directory,  Prev: Display and Define the Metrics,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.6 A First Customization of the Output
*c42dbd0eSchristos-----------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosWith the information just given, we can customize the function overview.
*c42dbd0eSchristosFor sake of the example, we would like to display the name of the
*c42dbd0eSchristosfunction first, followed by the exclusive CPU time, given as an absolute
*c42dbd0eSchristosnumber and a percentage.
*c42dbd0eSchristos
*c42dbd0eSchristos   Note that the commands are parsed in order of appearance.  This is
*c42dbd0eSchristoswhy we need to define the metrics _before_ requesting the function
*c42dbd0eSchristosoverview:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Current metrics: name:e.%totalcpu
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Name                Excl. Total
*c42dbd0eSchristos                         CPU
*c42dbd0eSchristos                          sec.      %
*c42dbd0eSchristos      <Total>            2.272 100.00
*c42dbd0eSchristos      mxv_core           2.160  95.04
*c42dbd0eSchristos      init_data          0.047   2.06
*c42dbd0eSchristos      erand48_r          0.030   1.32
*c42dbd0eSchristos      __drand48_iterate  0.013   0.57
*c42dbd0eSchristos      drand48            0.013   0.57
*c42dbd0eSchristos      _int_malloc        0.008   0.35
*c42dbd0eSchristos      brk                0.001   0.04
*c42dbd0eSchristos      sysmalloc          0.001   0.04
*c42dbd0eSchristos      __default_morecore 0.      0.
*c42dbd0eSchristos      __libc_start_main  0.      0.
*c42dbd0eSchristos      allocate_data      0.      0.
*c42dbd0eSchristos      collector_root     0.      0.
*c42dbd0eSchristos      driver_mxv         0.      0.
*c42dbd0eSchristos      main               0.      0.
*c42dbd0eSchristos      malloc             0.      0.
*c42dbd0eSchristos      sbrk               0.      0.
*c42dbd0eSchristos
*c42dbd0eSchristos   This was a first and simple example how to customize the output.
*c42dbd0eSchristosNote that we did not rerun our profiling job and merely modified the
*c42dbd0eSchristosdisplay settings.  Below we will show other and also more advanced
*c42dbd0eSchristosexamples of customization.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Name the Experiment Directory,  Next: Control the Number of Lines in the Output,  Prev: A First Customization of the Output,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.7 Name the Experiment Directory
*c42dbd0eSchristos-----------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosWhen using 'gprofng collect app', the default names for experiments work
*c42dbd0eSchristosfine, but they are quite generic.  It is often more convenient to select
*c42dbd0eSchristosa more descriptive name.  For example, one that reflects conditions for
*c42dbd0eSchristosthe experiment conducted.
*c42dbd0eSchristos
*c42dbd0eSchristos   For this, the mutually exclusive '-o' and '-O' options come in handy.
*c42dbd0eSchristosBoth may be used to provide a name for the experiment directory, but the
*c42dbd0eSchristosbehaviour of 'gprofng collect app' is different.
*c42dbd0eSchristos
*c42dbd0eSchristos   With the '-o' option, an existing experiment directory is not
*c42dbd0eSchristosoverwritten.  You either need to explicitly remove an existing directory
*c42dbd0eSchristosfirst, or use a name that is not in use yet.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is in contrast with the behaviour for the '-O' option.  Any
*c42dbd0eSchristosexisting (experiment) directory with the same name is silently
*c42dbd0eSchristosoverwritten.
*c42dbd0eSchristos
*c42dbd0eSchristos   Be aware that the name of the experiment directory has to end with
*c42dbd0eSchristos'.er'.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Control the Number of Lines in the Output,  Next: Sorting the Performance Data,  Prev: Name the Experiment Directory,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.8 Control the Number of Lines in the Output
*c42dbd0eSchristos-----------------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe 'limit <n>' command can be used to control the number of lines
*c42dbd0eSchristosprinted in various overviews, including the function view, but it also
*c42dbd0eSchristostakes effect for other display commands, like 'lines'.
*c42dbd0eSchristos
*c42dbd0eSchristos   The argument '<n>' should be a positive integer number.  It sets the
*c42dbd0eSchristosnumber of lines in the function view.  A value of zero resets the limit
*c42dbd0eSchristosto the default.
*c42dbd0eSchristos
*c42dbd0eSchristos   Be aware that the pseudo-function '<Total>' counts as a regular
*c42dbd0eSchristosfunction.  For example 'limit 10' displays nine user level functions.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Sorting the Performance Data,  Next: Scripting,  Prev: Control the Number of Lines in the Output,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.9 Sorting the Performance Data
*c42dbd0eSchristos----------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe 'sort <key>' command sets the key to be used when sorting the
*c42dbd0eSchristosperformance data.
*c42dbd0eSchristos
*c42dbd0eSchristos   The key is a valid metric definition, but the visibility field (*Note
*c42dbd0eSchristosMetric Definitions::) in the metric definition is ignored since this
*c42dbd0eSchristosdoes not affect the outcome of the sorting operation.  For example if we
*c42dbd0eSchristosset the sort key to 'e.totalcpu', the values will be sorted in
*c42dbd0eSchristosdescending order with respect to the exclusive total CPU time.
*c42dbd0eSchristos
*c42dbd0eSchristos   The data can be sorted in reverse order by prepending the metric
*c42dbd0eSchristosdefinition with a minus ('-') sign.  For example 'sort -e.totalcpu'.
*c42dbd0eSchristos
*c42dbd0eSchristos   A default metric for the sort operation has been defined and since
*c42dbd0eSchristosthis is a persistent command, this default can be restored with
*c42dbd0eSchristos'default' as the key.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Scripting,  Next: A More Elaborate Example,  Prev: Sorting the Performance Data,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.10 Scripting
*c42dbd0eSchristos----------------
*c42dbd0eSchristos
*c42dbd0eSchristosAs is probably clear by now, the list with commands for 'gprofng display
*c42dbd0eSchristostext' can be very long.  This is tedious and also error prone.  Luckily,
*c42dbd0eSchristosthere is an easier and more elegant way to control the behaviour of this
*c42dbd0eSchristostool.
*c42dbd0eSchristos
*c42dbd0eSchristos   Through the 'script' command, the name of a file with commands can be
*c42dbd0eSchristospassed in.  These commands are parsed and executed as if they appeared
*c42dbd0eSchristoson the command line in the same order as encountered in the file.  The
*c42dbd0eSchristoscommands in this script file can actually be mixed with commands on the
*c42dbd0eSchristoscommand line.
*c42dbd0eSchristos
*c42dbd0eSchristos   The difference between the commands in the script file and those used
*c42dbd0eSchristoson the command line is that the latter require a leading dash ('-')
*c42dbd0eSchristossymbol.
*c42dbd0eSchristos
*c42dbd0eSchristos   Comment lines are supported.  They need to start with the '#' symbol.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: A More Elaborate Example,  Next: The Call Tree,  Prev: Scripting,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.11 A More Elaborate Example
*c42dbd0eSchristos-------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosWith the information presented so far, we can customize our data
*c42dbd0eSchristosgathering and display commands.
*c42dbd0eSchristos
*c42dbd0eSchristos   As an example, to reflect the name of the algorithm and the number of
*c42dbd0eSchristosthreads that were used in the experiment, we select 'mxv.1.thr.er' as
*c42dbd0eSchristosthe name of the experiment directory.  All we then need to do is to add
*c42dbd0eSchristosthe '-O' option followed by this name on the command line when running
*c42dbd0eSchristos'gprofng collect app':
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
*c42dbd0eSchristos     $ m=3000
*c42dbd0eSchristos     $ n=2000
*c42dbd0eSchristos     $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
*c42dbd0eSchristos
*c42dbd0eSchristos   The commands to generate the profile are put into a file that we
*c42dbd0eSchristossimply call 'my-script':
*c42dbd0eSchristos
*c42dbd0eSchristos     $ cat my-script
*c42dbd0eSchristos     # This is my first gprofng script
*c42dbd0eSchristos     # Set the metrics
*c42dbd0eSchristos     metrics i.%totalcpu:e.%totalcpu:name
*c42dbd0eSchristos     # Use the exclusive time to sort
*c42dbd0eSchristos     sort e.totalcpu
*c42dbd0eSchristos     # Limit the function list to 5 lines
*c42dbd0eSchristos     limit 5
*c42dbd0eSchristos     # Show the function list
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   This script file is then specified as input to the 'gprofng display
*c42dbd0eSchristostext' command that is used to display the performance information stored
*c42dbd0eSchristosin 'mxv.1.thr.er':
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -script my-script mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   The command above produces the following output:
*c42dbd0eSchristos
*c42dbd0eSchristos     # This is my first gprofng script
*c42dbd0eSchristos     # Set the metrics
*c42dbd0eSchristos     Current metrics: i.%totalcpu:e.%totalcpu:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
*c42dbd0eSchristos     # Use the exclusive time to sort
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
*c42dbd0eSchristos     # Limit the function list to 5 lines
*c42dbd0eSchristos     Print limit set to 5
*c42dbd0eSchristos     # Show the function list
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Incl. Total   Excl. Total    Name
*c42dbd0eSchristos     CPU           CPU
*c42dbd0eSchristos      sec.      %   sec.      %
*c42dbd0eSchristos     2.272 100.00  2.272 100.00   <Total>
*c42dbd0eSchristos     2.159  95.00  2.159  95.00   mxv_core
*c42dbd0eSchristos     0.102   4.48  0.054   2.37   init_data
*c42dbd0eSchristos     0.035   1.54  0.025   1.10   erand48_r
*c42dbd0eSchristos     0.048   2.11  0.013   0.57   drand48
*c42dbd0eSchristos
*c42dbd0eSchristos   In the first part of the output, our comment lines in the script file
*c42dbd0eSchristosare shown.  These are interleaved with an acknowledgement message for
*c42dbd0eSchristosthe commands.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is followed by a profile consisting of 5 lines only.  For both
*c42dbd0eSchristosmetrics, the percentages plus the timings are given.  The numbers are
*c42dbd0eSchristossorted with respect to the exclusive total CPU time.
*c42dbd0eSchristos
*c42dbd0eSchristos   It is now immediately clear that function 'mxv_core' is responsbile
*c42dbd0eSchristosfor 95% of the CPU time and 'init_data' takes 4.5% only.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is also where we see sampling in action.  Although this is
*c42dbd0eSchristosexactly the same job we profiled before, the timings are somewhat
*c42dbd0eSchristosdifferent, but the differences are very small.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Call Tree,  Next: More Information on the Experiment,  Prev: A More Elaborate Example,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.12 The Call Tree
*c42dbd0eSchristos--------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe call tree shows the dynamic hierarchy of the application by
*c42dbd0eSchristosdisplaying the functions executed and their parent.  It helps to find
*c42dbd0eSchristosthe most expensive path in the program.
*c42dbd0eSchristos
*c42dbd0eSchristos   This feature is enabled through the 'calltree' command.  This is how
*c42dbd0eSchristosto get this tree for our current experiment:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -calltree mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   This displays the following structure:
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions Call Tree. Metric: Attributed Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Attr.      Name
*c42dbd0eSchristos     Total
*c42dbd0eSchristos     CPU sec.
*c42dbd0eSchristos     2.272      +-<Total>
*c42dbd0eSchristos     2.159        +-collector_root
*c42dbd0eSchristos     2.159        |  +-driver_mxv
*c42dbd0eSchristos     2.159        |    +-mxv_core
*c42dbd0eSchristos     0.114        +-__libc_start_main
*c42dbd0eSchristos     0.114          +-main
*c42dbd0eSchristos     0.102            +-init_data
*c42dbd0eSchristos     0.048            |  +-drand48
*c42dbd0eSchristos     0.035            |    +-erand48_r
*c42dbd0eSchristos     0.010            |      +-__drand48_iterate
*c42dbd0eSchristos     0.011            +-allocate_data
*c42dbd0eSchristos     0.011            |  +-malloc
*c42dbd0eSchristos     0.011            |    +-_int_malloc
*c42dbd0eSchristos     0.001            |      +-sysmalloc
*c42dbd0eSchristos     0.001            +-check_results
*c42dbd0eSchristos     0.001              +-malloc
*c42dbd0eSchristos     0.001                +-_int_malloc
*c42dbd0eSchristos
*c42dbd0eSchristos   At first sight this may not be what you expected and some explanation
*c42dbd0eSchristosis in place.
*c42dbd0eSchristos
*c42dbd0eSchristos   First of all, function 'collector_root' is internal to 'gprofng' and
*c42dbd0eSchristosshould be hidden to the user.  This is part of a planned future
*c42dbd0eSchristosenhancement.
*c42dbd0eSchristos
*c42dbd0eSchristos   Recall that the 'objects' and 'fsingle' commands are very useful to
*c42dbd0eSchristosfind out more about load objects in general, but also to help identify
*c42dbd0eSchristosan unknown entry in the function overview.  *Note Load Objects and
*c42dbd0eSchristosFunctions::.
*c42dbd0eSchristos
*c42dbd0eSchristos   Another thing to note is that there are two main branches.  The one
*c42dbd0eSchristosunder 'collector_root' and the second one under '__libc_start_main'.
*c42dbd0eSchristosThis reflects the fact that we are executing a parallel program.  Even
*c42dbd0eSchristosthough we only used one thread for this run, this is still executed in a
*c42dbd0eSchristosseparate path.
*c42dbd0eSchristos
*c42dbd0eSchristos   The main, sequential part of the program is displayed under 'main'
*c42dbd0eSchristosand shows the functions called and the time they took.
*c42dbd0eSchristos
*c42dbd0eSchristos   There are two things worth noting for the call tree feature:
*c42dbd0eSchristos
*c42dbd0eSchristos   * This is a dynamic tree and since sampling is used, it most likely
*c42dbd0eSchristos     looks slighlty different across seemingly identical profile runs.
*c42dbd0eSchristos     In case the run times are short, it is worth considering to use a
*c42dbd0eSchristos     high resolution through the '-p' option.  For example to use '-p
*c42dbd0eSchristos     hi' to increase the sampling rate.
*c42dbd0eSchristos
*c42dbd0eSchristos   * In case hardware event counters have been enabled (*Note Profile
*c42dbd0eSchristos     Hardware Event Counters::), these values are also displayed in the
*c42dbd0eSchristos     call tree view.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: More Information on the Experiment,  Next: Control the Sampling Frequency,  Prev: The Call Tree,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.13 More Information on the Experiment
*c42dbd0eSchristos-----------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe experiment directory not only contains performance related data.
*c42dbd0eSchristosSeveral system characteristics, the actually command executed, and some
*c42dbd0eSchristosglobal performance statistics can be displayed.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'header' command displays information about the experiment(s).
*c42dbd0eSchristosFor example, this is the command to extract this data from for our
*c42dbd0eSchristosexperiment directory:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -header mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   The above command prints the following information.  Note that some
*c42dbd0eSchristosof the lay-out and the information has been modified.  The textual
*c42dbd0eSchristoschanges are marked with the '<' and '>' symbols.
*c42dbd0eSchristos
*c42dbd0eSchristos     Experiment: mxv.1.thr.er
*c42dbd0eSchristos     No errors
*c42dbd0eSchristos     No warnings
*c42dbd0eSchristos     Archive command `gp-archive -n -a on
*c42dbd0eSchristos              --outfile <exp_dir>/archive.log <exp_dir>'
*c42dbd0eSchristos
*c42dbd0eSchristos     Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
*c42dbd0eSchristos     Process pid 30591, ppid 30589, pgrp 30551, sid 30468
*c42dbd0eSchristos     Current working directory: <cwd>
*c42dbd0eSchristos     Collector version: `2.36.50'; experiment version 12.4 (64-bit)
*c42dbd0eSchristos     Host `<hostname>', OS `Linux <version>', page size 4096,
*c42dbd0eSchristos          architecture `x86_64'
*c42dbd0eSchristos       16 CPUs, clock speed 1995 MHz.
*c42dbd0eSchristos       Memory: 30871514 pages @  4096 = 120591 MB.
*c42dbd0eSchristos     Data collection parameters:
*c42dbd0eSchristos       Clock-profiling, interval = 997 microsecs.
*c42dbd0eSchristos       Periodic sampling, 1 secs.
*c42dbd0eSchristos       Follow descendant processes from: fork|exec|combo
*c42dbd0eSchristos
*c42dbd0eSchristos     Experiment started <date and time>
*c42dbd0eSchristos
*c42dbd0eSchristos     Experiment Ended: 2.293162658
*c42dbd0eSchristos     Data Collection Duration: 2.293162658
*c42dbd0eSchristos
*c42dbd0eSchristos   The output above may assist in troubleshooting, or to verify some of
*c42dbd0eSchristosthe operational conditions and we recommand to include this command when
*c42dbd0eSchristosgenerating a profile.
*c42dbd0eSchristos
*c42dbd0eSchristos   Related to this command there is a useful option to record your own
*c42dbd0eSchristoscomment(s) in an experiment.  To this end, use the '-C' option on the
*c42dbd0eSchristos'gprofng collect app' tool to specify a comment string.  Up to ten
*c42dbd0eSchristoscomment lines can be included.  These comments are displayed with the
*c42dbd0eSchristos'header' command on the 'gprofng display text' tool.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'overview' command displays information on the experiment(s) and
*c42dbd0eSchristosalso shows a summary of the values for the metric(s) used.  This is an
*c42dbd0eSchristosexample how to use it on our newly created experiment directory:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -overview mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Experiment(s):
*c42dbd0eSchristos
*c42dbd0eSchristos     Experiment      :mxv.1.thr.er
*c42dbd0eSchristos       Target        : './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
*c42dbd0eSchristos       Host          : <hostname> (<ISA>, Linux <version>)
*c42dbd0eSchristos       Start Time    : <date and time>
*c42dbd0eSchristos       Duration      : 2.293 Seconds
*c42dbd0eSchristos
*c42dbd0eSchristos     Metrics:
*c42dbd0eSchristos
*c42dbd0eSchristos       Experiment Duration (Seconds): [2.293]
*c42dbd0eSchristos       Clock Profiling
*c42dbd0eSchristos         [X]Total CPU Time - totalcpu (Seconds): [*2.272]
*c42dbd0eSchristos
*c42dbd0eSchristos     Notes: '*' indicates hot metrics, '[X]' indicates currently enabled
*c42dbd0eSchristos            metrics.
*c42dbd0eSchristos            The metrics command can be used to change selections. The
*c42dbd0eSchristos            metric_list command lists all available metrics.
*c42dbd0eSchristos
*c42dbd0eSchristos   This command provides a dashboard overview that helps to easily
*c42dbd0eSchristosidentify where the time is spent and in case hardware event counters are
*c42dbd0eSchristosused, it shows their total values.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Control the Sampling Frequency,  Next: Information on Load Objects,  Prev: More Information on the Experiment,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.14 Control the Sampling Frequency
*c42dbd0eSchristos-------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosSo far we did not talk about the frequency of the sampling process, but
*c42dbd0eSchristosin some cases it is useful to change the default of 10 milliseconds.
*c42dbd0eSchristos
*c42dbd0eSchristos   The advantage of increasing the sampling frequency is that functions
*c42dbd0eSchristosthat do not take much time per invocation are more accurately captured.
*c42dbd0eSchristosThe downside is that more data is gathered.  This has an impact on the
*c42dbd0eSchristosoverhead of the collection process and more disk space is required.
*c42dbd0eSchristos
*c42dbd0eSchristos   In general this is not an immediate concern, but with heavily
*c42dbd0eSchristosthreaded applications that run for an extended period of time,
*c42dbd0eSchristosincreasing the frequency may have a more noticeable impact.
*c42dbd0eSchristos
*c42dbd0eSchristos   The '-p' option on the 'gprofng collect app' tool is used to enable
*c42dbd0eSchristosor disable clock based profiling, or to explicitly set the sampling
*c42dbd0eSchristosrate.  This option takes one of the following keywords:
*c42dbd0eSchristos
*c42dbd0eSchristos'off'
*c42dbd0eSchristos     Disable clock based profiling.
*c42dbd0eSchristos
*c42dbd0eSchristos'on'
*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
*c42dbd0eSchristos     10 ms.  This is the default.
*c42dbd0eSchristos
*c42dbd0eSchristos'lo'
*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
*c42dbd0eSchristos     100 ms.
*c42dbd0eSchristos
*c42dbd0eSchristos'hi'
*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
*c42dbd0eSchristos     1 ms.
*c42dbd0eSchristos
*c42dbd0eSchristos'<value>'
*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
*c42dbd0eSchristos     <value>.
*c42dbd0eSchristos
*c42dbd0eSchristos   One may wonder why there is an option to disable clock based
*c42dbd0eSchristosprofiling.  This is because by default, it is enabled when conducting
*c42dbd0eSchristoshardware event counter experiments (*Note Profile Hardware Event
*c42dbd0eSchristosCounters::).  With the '-p off' option, this can be disabled.
*c42dbd0eSchristos
*c42dbd0eSchristos   If an explicit value is set for the sampling, the number can be an
*c42dbd0eSchristosinteger or a floating-point number.  A suffix of 'u' for microseconds,
*c42dbd0eSchristosor 'm' for milliseconds is supported.  If no suffix is used, the value
*c42dbd0eSchristosis assumed to be in milliseconds.
*c42dbd0eSchristos
*c42dbd0eSchristos   If the value is smaller than the clock profiling minimum, a warning
*c42dbd0eSchristosmessage is issued and it is set to the minimum.  In case it is not a
*c42dbd0eSchristosmultiple of the clock profiling resolution, it is silently rounded down
*c42dbd0eSchristosto the nearest multiple of the clock resolution.
*c42dbd0eSchristos
*c42dbd0eSchristos   If the value exceeds the clock profiling maximum, is negative, or
*c42dbd0eSchristoszero, an error is reported.
*c42dbd0eSchristos
*c42dbd0eSchristos   Note that the 'header' command echoes the sampling rate used.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Information on Load Objects,  Prev: Control the Sampling Frequency,  Up: Getting Started
*c42dbd0eSchristos
*c42dbd0eSchristos3.1.15 Information on Load Objects
*c42dbd0eSchristos----------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosIt may happen that the function list contains a function that is not
*c42dbd0eSchristosknown to the user.  This can easily happen with library functions for
*c42dbd0eSchristosexample.  Luckily there are three commands that come in handy then.
*c42dbd0eSchristos
*c42dbd0eSchristos   These commands are 'objects', 'fsingle', and 'fsummary'.  They
*c42dbd0eSchristosprovide details on load objects (*Note Load Objects and Functions::).
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'objects' command lists all load objects that have been
*c42dbd0eSchristosreferenced during the performance experiment.  Below we show the command
*c42dbd0eSchristosand the result for our profile job.  Like before, the (long) path names
*c42dbd0eSchristosin the output have been shortened and replaced by the '<apath>' symbol
*c42dbd0eSchristosthat represents an absolute directory path.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -objects mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   The output includes the name and path of the target executable:
*c42dbd0eSchristos
*c42dbd0eSchristos      <Unknown> (<Unknown>)
*c42dbd0eSchristos      <mxv-pthreads.exe> (<apath>/mxv-pthreads.exe)
*c42dbd0eSchristos      <librt-2.17.so> (/usr/lib64/librt-2.17.so)
*c42dbd0eSchristos      <libdl-2.17.so> (/usr/lib64/libdl-2.17.so)
*c42dbd0eSchristos      <libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>)
*c42dbd0eSchristos      <libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>)
*c42dbd0eSchristos      <libc-2.17.so> (/usr/lib64/libc-2.17.so)
*c42dbd0eSchristos      <libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so)
*c42dbd0eSchristos      <libm-2.17.so> (/usr/lib64/libm-2.17.so)
*c42dbd0eSchristos      <libgp-collector.so> (<apath>/libgp-collector.so)
*c42dbd0eSchristos      <ld-2.17.so> (/usr/lib64/ld-2.17.so)
*c42dbd0eSchristos      <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'fsingle' command may be used to get more details on a specific
*c42dbd0eSchristosentry in the function view, say.  For example, the command below
*c42dbd0eSchristosprovides additional information on the 'collector_root' function shown
*c42dbd0eSchristosin the function overview.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -fsingle collector_root mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   Below the output from this command.  It has been somewhat modified to
*c42dbd0eSchristosmatch the display requirements.
*c42dbd0eSchristos
*c42dbd0eSchristos     collector_root
*c42dbd0eSchristos       Exclusive Total CPU Time: 0.    (  0. %)
*c42dbd0eSchristos       Inclusive Total CPU Time: 2.159 ( 95.0%)
*c42dbd0eSchristos                 Size:   401
*c42dbd0eSchristos           PC Address: 10:0x0001db60
*c42dbd0eSchristos          Source File: <apath>/dispatcher.c
*c42dbd0eSchristos          Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b
*c42dbd0eSchristos          Load Object: <apath>/libgp-collector.so
*c42dbd0eSchristos         Mangled Name:
*c42dbd0eSchristos              Aliases:
*c42dbd0eSchristos
*c42dbd0eSchristos   In this table we not only see how much time was spent in this
*c42dbd0eSchristosfunction, we also see where it originates from.  In addition to this,
*c42dbd0eSchristosthe size and start address are given as well.  If the source code
*c42dbd0eSchristoslocation is known it is also shown here.
*c42dbd0eSchristos
*c42dbd0eSchristos   The related 'fsummary' command displays the same information as
*c42dbd0eSchristos'fsingle', but for all functions in the function overview, including
*c42dbd0eSchristos'<Total>':
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -fsummary mxv.1.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     <Total>
*c42dbd0eSchristos       Exclusive Total CPU Time: 2.272 (100.0%)
*c42dbd0eSchristos       Inclusive Total CPU Time: 2.272 (100.0%)
*c42dbd0eSchristos                 Size:     0
*c42dbd0eSchristos           PC Address: 1:0x00000000
*c42dbd0eSchristos          Source File: (unknown)
*c42dbd0eSchristos          Object File: (unknown)
*c42dbd0eSchristos          Load Object: <Total>
*c42dbd0eSchristos         Mangled Name:
*c42dbd0eSchristos              Aliases:
*c42dbd0eSchristos
*c42dbd0eSchristos     mxv_core
*c42dbd0eSchristos       Exclusive Total CPU Time: 2.159 ( 95.0%)
*c42dbd0eSchristos       Inclusive Total CPU Time: 2.159 ( 95.0%)
*c42dbd0eSchristos                 Size:    75
*c42dbd0eSchristos           PC Address: 2:0x000021ba
*c42dbd0eSchristos          Source File: <apath>/mxv.c
*c42dbd0eSchristos          Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc
*c42dbd0eSchristos          Load Object: <apath>/mxv-pthreads.exe
*c42dbd0eSchristos         Mangled Name:
*c42dbd0eSchristos              Aliases:
*c42dbd0eSchristos
*c42dbd0eSchristos               ... etc ...
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Support for Multithreading,  Next: Viewing Multiple Experiments,  Prev: Getting Started,  Up: A Mini Tutorial
*c42dbd0eSchristos
*c42dbd0eSchristos3.2 Support for Multithreading
*c42dbd0eSchristos==============================
*c42dbd0eSchristos
*c42dbd0eSchristosIn this chapter we introduce and discuss the support for multithreading.
*c42dbd0eSchristosAs is shown below, nothing needs to be changed when collecting the
*c42dbd0eSchristosperformance data.
*c42dbd0eSchristos
*c42dbd0eSchristos   The difference is that additional commands are available to get more
*c42dbd0eSchristosinformation on the parallel environment, plus that several filters allow
*c42dbd0eSchristosthe user to zoom in on specific threads.
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Creating a Multithreading Experiment::
*c42dbd0eSchristos* Commands Specific to Multithreading::
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Creating a Multithreading Experiment,  Next: Commands Specific to Multithreading,  Up: Support for Multithreading
*c42dbd0eSchristos
*c42dbd0eSchristos3.2.1 Creating a Multithreading Experiment
*c42dbd0eSchristos------------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosWe demonstrate the support for multithreading using the same code and
*c42dbd0eSchristossettings as before, but this time we use 2 threads:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
*c42dbd0eSchristos     $ m=3000
*c42dbd0eSchristos     $ n=2000
*c42dbd0eSchristos     $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
*c42dbd0eSchristos
*c42dbd0eSchristos   First of all, note that we did not change anything, other than
*c42dbd0eSchristossetting the number of threads to 2.  Nothing special is needed to
*c42dbd0eSchristosprofile a multithreaded job when using 'gprofng'.
*c42dbd0eSchristos
*c42dbd0eSchristos   The same is true when displaying the performance results.  The same
*c42dbd0eSchristoscommands that we used before work unmodified.  For example, this is all
*c42dbd0eSchristosthat is needed to get a function overview:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gpprofng display text -limit 10 -functions mxv.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   This produces the following familiar looking output:
*c42dbd0eSchristos
*c42dbd0eSchristos     Print limit set to 10
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.     Incl.      Name
*c42dbd0eSchristos     Total     Total
*c42dbd0eSchristos     CPU sec.  CPU sec.
*c42dbd0eSchristos     2.268     2.268      <Total>
*c42dbd0eSchristos     2.155     2.155      mxv_core
*c42dbd0eSchristos     0.044     0.103      init_data
*c42dbd0eSchristos     0.030     0.046      erand48_r
*c42dbd0eSchristos     0.016     0.016      __drand48_iterate
*c42dbd0eSchristos     0.013     0.059      drand48
*c42dbd0eSchristos     0.008     0.011      _int_malloc
*c42dbd0eSchristos     0.003     0.003      brk
*c42dbd0eSchristos     0.        0.003      __default_morecore
*c42dbd0eSchristos     0.        0.114      __libc_start_main
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Commands Specific to Multithreading,  Prev: Creating a Multithreading Experiment,  Up: Support for Multithreading
*c42dbd0eSchristos
*c42dbd0eSchristos3.2.2 Commands Specific to Multithreading
*c42dbd0eSchristos-----------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe function overview shown above shows the results aggregated over all
*c42dbd0eSchristosthe threads.  The interesting new element is that we can also look at
*c42dbd0eSchristosthe performance data for the individual threads.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'thread_list' command displays how many threads have been used:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -thread_list mxv.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   This produces the following output, showing that three threads have
*c42dbd0eSchristosbeen used:
*c42dbd0eSchristos
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 all     3
*c42dbd0eSchristos
*c42dbd0eSchristos   The output confirms there is one experiment and that by default all
*c42dbd0eSchristosthreads are selected.
*c42dbd0eSchristos
*c42dbd0eSchristos   It may seem surprising to see three threads here, since we used the
*c42dbd0eSchristos'-t 2' option, but it is common for a Pthreads program to use one
*c42dbd0eSchristosadditional thread.  This is typically the thread that runs from start to
*c42dbd0eSchristosfinish and handles the sequential portions of the code, as well as takes
*c42dbd0eSchristoscare of managing the threads.
*c42dbd0eSchristos
*c42dbd0eSchristos   It is no different in our example code.  At some point, the main
*c42dbd0eSchristosthread creates and activates the two threads that perform the
*c42dbd0eSchristosmultiplication of the matrix with the vector.  Upon completion of this
*c42dbd0eSchristoscomputation, the main thread continues.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'threads' command is simple, yet very powerful.  It shows the
*c42dbd0eSchristostotal value of the metrics for each thread.  To make it easier to
*c42dbd0eSchristosinterpret the data, we modify the metrics to include percentages:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   The command above produces the following overview:
*c42dbd0eSchristos
*c42dbd0eSchristos     Current metrics: e.%totalcpu:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
*c42dbd0eSchristos     Objects sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     2.258 100.00   <Total>
*c42dbd0eSchristos     1.075  47.59   Process 1, Thread 3
*c42dbd0eSchristos     1.070  47.37   Process 1, Thread 2
*c42dbd0eSchristos     0.114   5.03   Process 1, Thread 1
*c42dbd0eSchristos
*c42dbd0eSchristos   The first line gives the total CPU time accumulated over the threads
*c42dbd0eSchristosselected.  This is followed by the metric value(s) for each thread.
*c42dbd0eSchristos
*c42dbd0eSchristos   From this it is clear that the main thread is responsible for 5% of
*c42dbd0eSchristosthe total CPU time, while the other two threads take 47% each.
*c42dbd0eSchristos
*c42dbd0eSchristos   This view is ideally suited to verify if there any load balancing
*c42dbd0eSchristosissues and also to find the most time consuming thread(s).
*c42dbd0eSchristos
*c42dbd0eSchristos   While useful, often more information than this is needed.  This is
*c42dbd0eSchristoswhere the thread selection filter comes in.  Through the 'thread_select'
*c42dbd0eSchristoscommand, one or more threads may be selected (*Note The Selection List::
*c42dbd0eSchristoshow to define the selection list).
*c42dbd0eSchristos
*c42dbd0eSchristos   Since it is most common to use this command in a script, we do so as
*c42dbd0eSchristoswell here.  Below the script we are using:
*c42dbd0eSchristos
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics e.%totalcpu
*c42dbd0eSchristos     # Limit the output to 10 lines
*c42dbd0eSchristos     limit 10
*c42dbd0eSchristos     # Get the function overview for thread 1
*c42dbd0eSchristos     thread_select 1
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for thread 2
*c42dbd0eSchristos     thread_select 2
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for thread 3
*c42dbd0eSchristos     thread_select 3
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   The definition of the metrics and the output limiter has been shown
*c42dbd0eSchristosand explained before and will be ignored.  The new command we focus on
*c42dbd0eSchristosis 'thread_select'.
*c42dbd0eSchristos
*c42dbd0eSchristos   This command takes a list (*Note The Selection List::) to select
*c42dbd0eSchristosspecific threads.  In this case we simply use the individual thread
*c42dbd0eSchristosnumbers that we obtained with the 'thread_list' command earlier.
*c42dbd0eSchristos
*c42dbd0eSchristos   This restricts the output of the 'functions' command to the thread
*c42dbd0eSchristosnumber(s) specified.  This means that the script above shows which
*c42dbd0eSchristosfunction(s) each thread executes and how much CPU time they consumed.
*c42dbd0eSchristosBoth the timings and their percentages are given.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is the relevant part of the output for the first thread:
*c42dbd0eSchristos
*c42dbd0eSchristos     # Get the function overview for thread 1
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 1       3
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     0.114 100.00   <Total>
*c42dbd0eSchristos     0.051  44.74   init_data
*c42dbd0eSchristos     0.028  24.56   erand48_r
*c42dbd0eSchristos     0.017  14.91   __drand48_iterate
*c42dbd0eSchristos     0.010   8.77   _int_malloc
*c42dbd0eSchristos     0.008   7.02   drand48
*c42dbd0eSchristos     0.      0.     __libc_start_main
*c42dbd0eSchristos     0.      0.     allocate_data
*c42dbd0eSchristos     0.      0.     main
*c42dbd0eSchristos     0.      0.     malloc
*c42dbd0eSchristos
*c42dbd0eSchristos   As usual, the comment lines are echoed.  This is followed by a
*c42dbd0eSchristosconfirmation of our selection.  We see that indeed thread 1 has been
*c42dbd0eSchristosselected.  What is displayed next is the function overview for this
*c42dbd0eSchristosparticular thread.  Due to the 'limit 10' command, there are ten entries
*c42dbd0eSchristosin this list.
*c42dbd0eSchristos
*c42dbd0eSchristos   Below are the overviews for threads 2 and 3 respectively.  We see
*c42dbd0eSchristosthat all of the CPU time is spent in function 'mxv_core' and that this
*c42dbd0eSchristostime is approximately the same for both threads.
*c42dbd0eSchristos
*c42dbd0eSchristos     # Get the function overview for thread 2
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 2       3
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     1.072 100.00   <Total>
*c42dbd0eSchristos     1.072 100.00   mxv_core
*c42dbd0eSchristos     0.      0.     collector_root
*c42dbd0eSchristos     0.      0.     driver_mxv
*c42dbd0eSchristos
*c42dbd0eSchristos     # Get the function overview for thread 3
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 3       3
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     1.076 100.00   <Total>
*c42dbd0eSchristos     1.076 100.00   mxv_core
*c42dbd0eSchristos     0.      0.     collector_root
*c42dbd0eSchristos     0.      0.     driver_mxv
*c42dbd0eSchristos
*c42dbd0eSchristos   When analyzing the performance of a multithreaded application, it is
*c42dbd0eSchristossometimes useful to know whether threads have mostly executed on the
*c42dbd0eSchristossame core, say, or if they have wandered across multiple cores.  This
*c42dbd0eSchristossort of stickiness is usually referred to as _thread affinity_.
*c42dbd0eSchristos
*c42dbd0eSchristos   Similar to the commands for the threads, there are several commands
*c42dbd0eSchristosrelated to the usage of the cores, or _CPUs_ as they are called in
*c42dbd0eSchristos'gprofng' (*Note The Concept of a CPU in gprofng::).
*c42dbd0eSchristos
*c42dbd0eSchristos   In order to have some more interesting data to look at, we created a
*c42dbd0eSchristosnew experiment, this time using 8 threads:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
*c42dbd0eSchristos     $ m=3000
*c42dbd0eSchristos     $ n=2000
*c42dbd0eSchristos     $ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8
*c42dbd0eSchristos
*c42dbd0eSchristos   Similar to the 'thread_list' command, the 'cpu_list' command displays
*c42dbd0eSchristoshow many CPUs have been used.  The equivalent of the 'threads' threads
*c42dbd0eSchristoscommand, is the 'cpus' command, which shows the CPU numbers that were
*c42dbd0eSchristosused and how much time was spent on each of them.  Both are demonstrated
*c42dbd0eSchristosbelow.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   This command produces the following output:
*c42dbd0eSchristos
*c42dbd0eSchristos     Current metrics: e.%totalcpu:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 all    10
*c42dbd0eSchristos     Objects sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     2.310 100.00   <Total>
*c42dbd0eSchristos     0.286  12.39   CPU 7
*c42dbd0eSchristos     0.284  12.30   CPU 13
*c42dbd0eSchristos     0.282  12.21   CPU 5
*c42dbd0eSchristos     0.280  12.13   CPU 14
*c42dbd0eSchristos     0.266  11.52   CPU 9
*c42dbd0eSchristos     0.265  11.48   CPU 2
*c42dbd0eSchristos     0.264  11.44   CPU 11
*c42dbd0eSchristos     0.194   8.42   CPU 0
*c42dbd0eSchristos     0.114   4.92   CPU 1
*c42dbd0eSchristos     0.074   3.19   CPU 15
*c42dbd0eSchristos
*c42dbd0eSchristos   What we see in this table is that a total of 10 CPUs have been used.
*c42dbd0eSchristosThis is followed by a list with all the CPU numbers that have been used
*c42dbd0eSchristosduring the run.  For each CPU it is shown how much time was spent on it.
*c42dbd0eSchristos
*c42dbd0eSchristos   While the table with thread times shown earlier may point at a load
*c42dbd0eSchristosimbalance in the application, this overview has a different purpose.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, we see that 10 CPUs have been used, but we know that the
*c42dbd0eSchristosapplication uses 9 threads only.  This means that at least one thread
*c42dbd0eSchristoshas executed on more than one CPU. In itself this is not something to
*c42dbd0eSchristosworry about, but warrants a deeper investigation.
*c42dbd0eSchristos
*c42dbd0eSchristos   Honesty dictates that next we performed a pre-analysis to find out
*c42dbd0eSchristoswhich thread(s) have been running on more than one CPU. We found this to
*c42dbd0eSchristosbe thread 7.  It has executed on CPUs 0 and 15.
*c42dbd0eSchristos
*c42dbd0eSchristos   With this knowledge, we wrote the script shown below.  It zooms in on
*c42dbd0eSchristosthe behaviour of thread 7.
*c42dbd0eSchristos
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics e.%totalcpu
*c42dbd0eSchristos     # Limit the output to 10 lines
*c42dbd0eSchristos     limit 10
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for CPU 0
*c42dbd0eSchristos     cpu_select 0
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for CPU 15
*c42dbd0eSchristos     cpu_select 15
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   From the earlier shown threads overview, we know that thread 7 has
*c42dbd0eSchristosused '0.268' seconds of CPU time..
*c42dbd0eSchristos
*c42dbd0eSchristos   By selecting CPUs 0 and 15, respectively, we get the following
*c42dbd0eSchristosfunction overviews:
*c42dbd0eSchristos
*c42dbd0eSchristos     # Get the function overview for CPU 0
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 0      10
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     0.194 100.00   <Total>
*c42dbd0eSchristos     0.194 100.00   mxv_core
*c42dbd0eSchristos     0.      0.     collector_root
*c42dbd0eSchristos     0.      0.     driver_mxv
*c42dbd0eSchristos
*c42dbd0eSchristos     # Get the function overview for CPU 15
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 15     10
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     0.074 100.00   <Total>
*c42dbd0eSchristos     0.074 100.00   mxv_core
*c42dbd0eSchristos     0.      0.     collector_root
*c42dbd0eSchristos     0.      0.     driver_mxv
*c42dbd0eSchristos
*c42dbd0eSchristos   This shows that thread 7 spent '0.194' seconds on CPU 0 and '0.074'
*c42dbd0eSchristosseconds on CPU 15.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Viewing Multiple Experiments,  Next: Profile Hardware Event Counters,  Prev: Support for Multithreading,  Up: A Mini Tutorial
*c42dbd0eSchristos
*c42dbd0eSchristos3.3 Viewing Multiple Experiments
*c42dbd0eSchristos================================
*c42dbd0eSchristos
*c42dbd0eSchristosOne thing we did not cover sofar is that 'gprofng' fully supports the
*c42dbd0eSchristosanalysis of multiple experiments.  The 'gprofng display text' tool
*c42dbd0eSchristosaccepts a list of experiments.  The data can either be aggregated across
*c42dbd0eSchristosthe experiments, or used in a comparison.
*c42dbd0eSchristos
*c42dbd0eSchristos   Mention 'experiment_list'
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Aggregation of Experiments::
*c42dbd0eSchristos* Comparison of Experiments::
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Aggregation of Experiments,  Next: Comparison of Experiments,  Up: Viewing Multiple Experiments
*c42dbd0eSchristos
*c42dbd0eSchristos3.3.1 Aggregation of Experiments
*c42dbd0eSchristos--------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosBy default, the data for multiple experiments is aggregrated and the
*c42dbd0eSchristosdisplay commands shows these combined results.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, we can aggregate the data for our single and dual thread
*c42dbd0eSchristosexperiments.  Below is the script we used for this:
*c42dbd0eSchristos
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics e.%totalcpu
*c42dbd0eSchristos     # Limit the output to 10 lines
*c42dbd0eSchristos     limit 10
*c42dbd0eSchristos     # Get the list with experiments
*c42dbd0eSchristos     experiment_list
*c42dbd0eSchristos     # Get the function overview
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   With the exception of the 'experiment_list' command, all commands
*c42dbd0eSchristosused have been discussed earlier.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'experiment_list' command provides a list of the experiments that
*c42dbd0eSchristoshave been loaded.  This is is used to verify we are looking at the
*c42dbd0eSchristosexperiments we intend to aggregate.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   With the command above, we get the following output:
*c42dbd0eSchristos
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     Current metrics: e.%totalcpu:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
*c42dbd0eSchristos     # Limit the output to 10 lines
*c42dbd0eSchristos     Print limit set to 10
*c42dbd0eSchristos     # Get the list with experiments
*c42dbd0eSchristos     ID Sel   PID Experiment
*c42dbd0eSchristos     == === ===== ============
*c42dbd0eSchristos      1 yes 30591 mxv.1.thr.er
*c42dbd0eSchristos      2 yes 11629 mxv.2.thr.er
*c42dbd0eSchristos     # Get the function overview
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. Total    Name
*c42dbd0eSchristos     CPU
*c42dbd0eSchristos      sec.      %
*c42dbd0eSchristos     4.533 100.00   <Total>
*c42dbd0eSchristos     4.306  94.99   mxv_core
*c42dbd0eSchristos     0.105   2.31   init_data
*c42dbd0eSchristos     0.053   1.17   erand48_r
*c42dbd0eSchristos     0.027   0.59   __drand48_iterate
*c42dbd0eSchristos     0.021   0.46   _int_malloc
*c42dbd0eSchristos     0.021   0.46   drand48
*c42dbd0eSchristos     0.001   0.02   sysmalloc
*c42dbd0eSchristos     0.      0.     __libc_start_main
*c42dbd0eSchristos     0.      0.     allocate_data
*c42dbd0eSchristos
*c42dbd0eSchristos   The first five lines should look familiar.  The five lines following,
*c42dbd0eSchristosecho the comment line in the script and show the overview of the
*c42dbd0eSchristosexperiments.  This confirms two experiments have been loaded and that
*c42dbd0eSchristosboth are active.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is followed by the function overview.  The timings have been
*c42dbd0eSchristossummed up and the percentages are adjusted accordingly.  For example,
*c42dbd0eSchristosthe total accumulated time is indeed 2.272 + 2.261 = 4.533 seconds.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Comparison of Experiments,  Prev: Aggregation of Experiments,  Up: Viewing Multiple Experiments
*c42dbd0eSchristos
*c42dbd0eSchristos3.3.2 Comparison of Experiments
*c42dbd0eSchristos-------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe support for multiple experiments really shines in comparison mode.
*c42dbd0eSchristosThis feature is enabled through the command 'compare on' and is disabled
*c42dbd0eSchristosby setting 'compare off'.
*c42dbd0eSchristos
*c42dbd0eSchristos   In comparison mode, the data for the various experiments is shown
*c42dbd0eSchristosside by side, as illustrated below where we compare the results for the
*c42dbd0eSchristosmultithreaded experiments using one and two threads respectively:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristosThis produces the following output:
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     mxv.1.thr.er  mxv.2.thr.er  mxv.1.thr.er  mxv.2.thr.er
*c42dbd0eSchristos     Excl. Total   Excl. Total   Incl. Total   Incl. Total    Name
*c42dbd0eSchristos     CPU           CPU           CPU           CPU
*c42dbd0eSchristos      sec.          sec.          sec.          sec.
*c42dbd0eSchristos     2.272         2.261         2.272         2.261          <Total>
*c42dbd0eSchristos     2.159         2.148         2.159         2.148          mxv_core
*c42dbd0eSchristos     0.054         0.051         0.102         0.104          init_data
*c42dbd0eSchristos     0.025         0.028         0.035         0.045          erand48_r
*c42dbd0eSchristos     0.013         0.008         0.048         0.053          drand48
*c42dbd0eSchristos     0.011         0.010         0.012         0.010          _int_malloc
*c42dbd0eSchristos     0.010         0.017         0.010         0.017          __drand48_iterate
*c42dbd0eSchristos     0.001         0.            0.001         0.             sysmalloc
*c42dbd0eSchristos     0.            0.            0.114         0.114          __libc_start_main
*c42dbd0eSchristos     0.            0.            0.011         0.010          allocate_data
*c42dbd0eSchristos     0.            0.            0.001         0.             check_results
*c42dbd0eSchristos     0.            0.            2.159         2.148          collector_root
*c42dbd0eSchristos     0.            0.            2.159         2.148          driver_mxv
*c42dbd0eSchristos     0.            0.            0.114         0.114          main
*c42dbd0eSchristos     0.            0.            0.012         0.010          malloc
*c42dbd0eSchristos
*c42dbd0eSchristos   This table is already helpful to more easily compare (two) profiles,
*c42dbd0eSchristosbut there is more that we can do here.
*c42dbd0eSchristos
*c42dbd0eSchristos   By default, in comparison mode, all measured values are shown.  Often
*c42dbd0eSchristosprofiling is about comparing performance data.  It is therefore more
*c42dbd0eSchristosuseful to look at differences, or ratios, using one experiment as a
*c42dbd0eSchristosreference.
*c42dbd0eSchristos
*c42dbd0eSchristos   The values shown are relative to this difference.  For example if a
*c42dbd0eSchristosratio is below one, it means the reference value was higher.
*c42dbd0eSchristos
*c42dbd0eSchristos   This feature is supported on the 'compare' command.  In addition to
*c42dbd0eSchristos'on', or 'off', this command also supports 'delta', or 'ratio'.
*c42dbd0eSchristos
*c42dbd0eSchristos   Usage of one of these two keywords enables the comparison feature and
*c42dbd0eSchristosshows either the difference, or the ratio, relative to the reference
*c42dbd0eSchristosdata.
*c42dbd0eSchristos
*c42dbd0eSchristos   In the example below, we use the same two experiments used in the
*c42dbd0eSchristoscomparison above, but as before, the number of lines is restricted to 10
*c42dbd0eSchristosand we focus on the exclusive timings plus percentages.  For the
*c42dbd0eSchristoscomparison part we are interested in the differences.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is the script that produces such an overview:
*c42dbd0eSchristos
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics e.%totalcpu
*c42dbd0eSchristos     # Limit the output to 10 lines
*c42dbd0eSchristos     limit 10
*c42dbd0eSchristos     # Set the comparison mode to differences
*c42dbd0eSchristos     compare delta
*c42dbd0eSchristos     # Get the function overview
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   Assuming this script file is called 'my-script-comp', this is how we
*c42dbd0eSchristosget the table displayed on our screen:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   Leaving out some of the lines printed, but we have seen before, we
*c42dbd0eSchristosget the following table:
*c42dbd0eSchristos
*c42dbd0eSchristos     mxv.1.thr.er  mxv.2.thr.er
*c42dbd0eSchristos     Excl. Total   Excl. Total     Name
*c42dbd0eSchristos     CPU           CPU
*c42dbd0eSchristos      sec.      %   delta      %
*c42dbd0eSchristos     2.272 100.00  -0.011 100.00   <Total>
*c42dbd0eSchristos     2.159  95.00  -0.011  94.97   mxv_core
*c42dbd0eSchristos     0.054   2.37  -0.003   2.25   init_data
*c42dbd0eSchristos     0.025   1.10  +0.003   1.23   erand48_r
*c42dbd0eSchristos     0.013   0.57  -0.005   0.35   drand48
*c42dbd0eSchristos     0.011   0.48  -0.001   0.44   _int_malloc
*c42dbd0eSchristos     0.010   0.44  +0.007   0.75   __drand48_iterate
*c42dbd0eSchristos     0.001   0.04  -0.001   0.     sysmalloc
*c42dbd0eSchristos     0.      0.    +0.      0.     __libc_start_main
*c42dbd0eSchristos     0.      0.    +0.      0.     allocate_data
*c42dbd0eSchristos
*c42dbd0eSchristos   It is now easy to see that the CPU times for the most time consuming
*c42dbd0eSchristosfunctions in this code are practically the same.
*c42dbd0eSchristos
*c42dbd0eSchristos   While in this case we used the delta as a comparison,
*c42dbd0eSchristos
*c42dbd0eSchristos   Note that the comparison feature is supported at the function,
*c42dbd0eSchristossource, and disassembly level.  There is no practical limit on the
*c42dbd0eSchristosnumber of experiments that can be used in a comparison.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Profile Hardware Event Counters,  Next: Java Profiling,  Prev: Viewing Multiple Experiments,  Up: A Mini Tutorial
*c42dbd0eSchristos
*c42dbd0eSchristos3.4 Profile Hardware Event Counters
*c42dbd0eSchristos===================================
*c42dbd0eSchristos
*c42dbd0eSchristosMany processors provide a set of hardware event counters and 'gprofng'
*c42dbd0eSchristosprovides support for this feature.  *Note Hardware Event Counters
*c42dbd0eSchristosExplained:: for those readers that are not familiar with such counters
*c42dbd0eSchristosand like to learn more.
*c42dbd0eSchristos
*c42dbd0eSchristos   In this section we explain how to get the details on the event
*c42dbd0eSchristoscounter support for the processor used in the experiment(s), and show
*c42dbd0eSchristosseveral examples.
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Getting Information on the Counters Supported::
*c42dbd0eSchristos* Examples Using Hardware Event Counters::
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Getting Information on the Counters Supported,  Next: Examples Using Hardware Event Counters,  Up: Profile Hardware Event Counters
*c42dbd0eSchristos
*c42dbd0eSchristos3.4.1 Getting Information on the Counters Supported
*c42dbd0eSchristos---------------------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe first step is to check if the processor used for the experiments is
*c42dbd0eSchristossupported by 'gprofng'.
*c42dbd0eSchristos
*c42dbd0eSchristos   The '-h' option on 'gprofng collect app' will show the event counter
*c42dbd0eSchristosinformation:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng collect app -h
*c42dbd0eSchristos
*c42dbd0eSchristos   In case the counters are supported, a list with the events is
*c42dbd0eSchristosprinted.  Otherwise, a warning message will be issued.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, below we show this command and the output on an Intel
*c42dbd0eSchristosXeon Platinum 8167M (aka "Skylake") processor.  The output has been
*c42dbd0eSchristossplit into several sections and each section is commented upon
*c42dbd0eSchristosseparately.
*c42dbd0eSchristos
*c42dbd0eSchristos     Run "gprofng collect app --help" for a usage message.
*c42dbd0eSchristos
*c42dbd0eSchristos     Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85'
*c42dbd0eSchristos     (cpuver=2499):
*c42dbd0eSchristos
*c42dbd0eSchristos       -h {auto|lo|on|hi}
*c42dbd0eSchristos     	turn on default set of HW counters at the specified rate
*c42dbd0eSchristos       -h <ctr_def> [-h <ctr_def>]...
*c42dbd0eSchristos       -h <ctr_def>[,<ctr_def>]...
*c42dbd0eSchristos     	specify HW counter profiling for up to 4 HW counters
*c42dbd0eSchristos
*c42dbd0eSchristos   The first line shows how to get a usage overview.  This is followed
*c42dbd0eSchristosby some information on the target processor.
*c42dbd0eSchristos
*c42dbd0eSchristos   The next five lines explain in what ways the '-h' option can be used
*c42dbd0eSchristosto define the events to be monitored.
*c42dbd0eSchristos
*c42dbd0eSchristos   The first version shown above enables a default set of counters.
*c42dbd0eSchristosThis default depends on the processor this command is executed on.  The
*c42dbd0eSchristoskeyword following the '-h' option defines the sampling rate:
*c42dbd0eSchristos
*c42dbd0eSchristos'auto'
*c42dbd0eSchristos     Match the sample rate of used by clock profiling.  If the latter is
*c42dbd0eSchristos     disabled, Use a per thread sampling rate of approximately 100
*c42dbd0eSchristos     samples per second.  This setting is the default and preferred.
*c42dbd0eSchristos
*c42dbd0eSchristos'on'
*c42dbd0eSchristos     Use a per thread sampling rate of approximately 100 samples per
*c42dbd0eSchristos     second.
*c42dbd0eSchristos
*c42dbd0eSchristos'lo'
*c42dbd0eSchristos     Use a per thread sampling rate of approximately 10 samples per
*c42dbd0eSchristos     second.
*c42dbd0eSchristos
*c42dbd0eSchristos'hi'
*c42dbd0eSchristos     Use a per thread sampling rate of approximately 1000 samples per
*c42dbd0eSchristos     second.
*c42dbd0eSchristos
*c42dbd0eSchristos   The second and third variant define the events to be monitored.  Note
*c42dbd0eSchristosthat the number of simultaneous events supported is printed.  In this
*c42dbd0eSchristoscase we can monitor four events in a single profiling job.
*c42dbd0eSchristos
*c42dbd0eSchristos   It is a matter of preference whether you like to use the '-h' option
*c42dbd0eSchristosfor each event, or use it once, followed by a comma separated list.
*c42dbd0eSchristos
*c42dbd0eSchristos   There is one slight catch though.  The counter definition below has
*c42dbd0eSchristosmandatory comma (',') between the event and the rate.  While a default
*c42dbd0eSchristoscan be used for the rate, the comma cannot be omitted.  This may result
*c42dbd0eSchristosin a somewhat awkward counter definition in case the default sampling
*c42dbd0eSchristosrate is used.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, the following two commands are equivalent.  Note the
*c42dbd0eSchristosdouble comma in the second command.  This is not a typo.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng collect app -h cycles -h insts ...
*c42dbd0eSchristos     $ gprofng collect app -h cycles,,insts ...
*c42dbd0eSchristos
*c42dbd0eSchristos   In the first command this comma is not needed, because a comma
*c42dbd0eSchristos("','") immediately followed by white space may be omitted.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is why we prefer the this syntax and in the remainder will use
*c42dbd0eSchristosthe first version of this command.
*c42dbd0eSchristos
*c42dbd0eSchristos   The counter definition takes an event name, plus optionally one or
*c42dbd0eSchristosmore attributes, followed by a comma, and optionally the sampling rate.
*c42dbd0eSchristosThe output section below shows the formal definition.
*c42dbd0eSchristos
*c42dbd0eSchristos       <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
*c42dbd0eSchristos
*c42dbd0eSchristos   The printed help then explains this syntax.  Below we have summarized
*c42dbd0eSchristosand expanded this output:
*c42dbd0eSchristos
*c42dbd0eSchristos'<ctr>'
*c42dbd0eSchristos     The counter name must be selected from the available counters
*c42dbd0eSchristos     listed as part of the output printed with the '-h' option.  On most
*c42dbd0eSchristos     systems, if a counter is not listed, it may still be specified by
*c42dbd0eSchristos     its numeric value.
*c42dbd0eSchristos
*c42dbd0eSchristos'~<attr>=<val>'
*c42dbd0eSchristos     This is an optional attribute that depends on the processor.  The
*c42dbd0eSchristos     list of supported attributes is printed in the output.  Examples of
*c42dbd0eSchristos     attributes are "user", or "system".  The value can given in decimal
*c42dbd0eSchristos     or hexadecimal format.  Multiple attributes may be specified, and
*c42dbd0eSchristos     each must be preceded by a ~.
*c42dbd0eSchristos
*c42dbd0eSchristos'<rate>'
*c42dbd0eSchristos
*c42dbd0eSchristos     The sampling rate is one of the following:
*c42dbd0eSchristos
*c42dbd0eSchristos     'auto'
*c42dbd0eSchristos          This is the default and matches the rate used by clock
*c42dbd0eSchristos          profiling.  If clock profiling is disabled, use 'on'.
*c42dbd0eSchristos
*c42dbd0eSchristos     'on'
*c42dbd0eSchristos          Set the per thread maximum sampling rate to ~100
*c42dbd0eSchristos          samples/second
*c42dbd0eSchristos
*c42dbd0eSchristos     'lo'
*c42dbd0eSchristos          Set the per thread maximum sampling rate to ~10 samples/second
*c42dbd0eSchristos
*c42dbd0eSchristos     'hi'
*c42dbd0eSchristos          Set the per thread maximum sampling rate to ~1000
*c42dbd0eSchristos          samples/second
*c42dbd0eSchristos
*c42dbd0eSchristos     '<interval>'
*c42dbd0eSchristos          Define the sampling interval.  *Note Control the Sampling
*c42dbd0eSchristos          Frequency:: how to define this.
*c42dbd0eSchristos
*c42dbd0eSchristos   After the section with the formal definition of events and counters,
*c42dbd0eSchristosa processor specific list is displayed.  This part starts with an
*c42dbd0eSchristosoverview of the default set of counters and the aliased names supported
*c42dbd0eSchristos_on this specific processor_.
*c42dbd0eSchristos
*c42dbd0eSchristos     Default set of HW counters:
*c42dbd0eSchristos
*c42dbd0eSchristos         -h cycles,,insts,,llm
*c42dbd0eSchristos
*c42dbd0eSchristos     Aliases for most useful HW counters:
*c42dbd0eSchristos
*c42dbd0eSchristos      alias    raw name                   type units regs description
*c42dbd0eSchristos
*c42dbd0eSchristos      cycles   unhalted-core-cycles   CPU-cycles 0123 CPU Cycles
*c42dbd0eSchristos      insts    instruction-retired        events 0123 Instructions Executed
*c42dbd0eSchristos      llm      llc-misses                 events 0123 Last-Level Cache Misses
*c42dbd0eSchristos      br_msp   branch-misses-retired      events 0123 Branch Mispredict
*c42dbd0eSchristos      br_ins   branch-instruction-retired events 0123 Branch Instructions
*c42dbd0eSchristos
*c42dbd0eSchristos   The definitions given above may or may not be available on other
*c42dbd0eSchristosprocessors, but we try to maximize the overlap across alias sets.
*c42dbd0eSchristos
*c42dbd0eSchristos   The table above shows the default set of counters defined for this
*c42dbd0eSchristosprocessor, and the aliases.  For each alias the full "raw" name is
*c42dbd0eSchristosgiven, plus the unit of the number returned by the counter (CPU cycles,
*c42dbd0eSchristosor a raw count), the hardware counter the event is allowed to be mapped
*c42dbd0eSchristosonto, and a short description.
*c42dbd0eSchristos
*c42dbd0eSchristos   The last part of the output contains all the events that can be
*c42dbd0eSchristosmonitored:
*c42dbd0eSchristos
*c42dbd0eSchristos     Raw HW counters:
*c42dbd0eSchristos
*c42dbd0eSchristos         name                                type      units regs description
*c42dbd0eSchristos
*c42dbd0eSchristos         unhalted-core-cycles                     CPU-cycles 0123
*c42dbd0eSchristos         unhalted-reference-cycles                    events 0123
*c42dbd0eSchristos         instruction-retired                          events 0123
*c42dbd0eSchristos         llc-reference                                events 0123
*c42dbd0eSchristos         llc-misses                                   events 0123
*c42dbd0eSchristos         branch-instruction-retired                   events 0123
*c42dbd0eSchristos         branch-misses-retired                        events 0123
*c42dbd0eSchristos         ld_blocks.store_forward                      events 0123
*c42dbd0eSchristos         ld_blocks.no_sr                              events 0123
*c42dbd0eSchristos         ld_blocks_partial.address_alias              events 0123
*c42dbd0eSchristos         dtlb_load_misses.miss_causes_a_walk          events 0123
*c42dbd0eSchristos         dtlb_load_misses.walk_completed_4k           events 0123
*c42dbd0eSchristos
*c42dbd0eSchristos         <many lines deleted>
*c42dbd0eSchristos
*c42dbd0eSchristos         l2_lines_out.silent                          events 0123
*c42dbd0eSchristos         l2_lines_out.non_silent                      events 0123
*c42dbd0eSchristos         l2_lines_out.useless_hwpf                    events 0123
*c42dbd0eSchristos         sq_misc.split_lock                           events 0123
*c42dbd0eSchristos
*c42dbd0eSchristos     See Chapter 19 of the "Intel 64 and IA-32 Architectures Software
*c42dbd0eSchristos     Developer's Manual Volume 3B: System Programming Guide"
*c42dbd0eSchristos
*c42dbd0eSchristos   As can be seen, these names are not always easy to correlate to a
*c42dbd0eSchristosspecific event of interest.  The processor manual should provide more
*c42dbd0eSchristosclarity on this.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Examples Using Hardware Event Counters,  Prev: Getting Information on the Counters Supported,  Up: Profile Hardware Event Counters
*c42dbd0eSchristos
*c42dbd0eSchristos3.4.2 Examples Using Hardware Event Counters
*c42dbd0eSchristos--------------------------------------------
*c42dbd0eSchristos
*c42dbd0eSchristosThe previous section may give the impression that these counters are
*c42dbd0eSchristoshard to use, but as we will show now, in practice it is quite simple.
*c42dbd0eSchristos
*c42dbd0eSchristos   With the information from the '-h' option, we can easily set up our
*c42dbd0eSchristosfirst event counter experiment.
*c42dbd0eSchristos
*c42dbd0eSchristos   We start by using the default set of counters defined for our
*c42dbd0eSchristosprocessor and we use 2 threads:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
*c42dbd0eSchristos     $ m=3000
*c42dbd0eSchristos     $ n=2000
*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
*c42dbd0eSchristos     $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
*c42dbd0eSchristos
*c42dbd0eSchristos   The new option here is '-h auto'.  The 'auto' keyword enables
*c42dbd0eSchristoshardware event counter profiling and selects the default set of counters
*c42dbd0eSchristosdefined for this processor.
*c42dbd0eSchristos
*c42dbd0eSchristos   As before, we can display the information, but there is one practical
*c42dbd0eSchristoshurdle to take.  Unless we like to view all metrics recorded, we would
*c42dbd0eSchristosneed to know the names of the events that have been enabled.  This is
*c42dbd0eSchristostedious and also not portable in case we would like to repeat this
*c42dbd0eSchristosexperiment on another processor.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is where the special 'hwc' metric comes very handy.  It
*c42dbd0eSchristosautomatically expands to the active set of events used.
*c42dbd0eSchristos
*c42dbd0eSchristos   With this, it is very easy to display the event counter values.  Note
*c42dbd0eSchristosthat although the regular clock based profiling was enabled, we only
*c42dbd0eSchristoswant to see the counter values.  We also request to see the percentages
*c42dbd0eSchristosand limit the output to the first 5 lines:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
*c42dbd0eSchristos     $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
*c42dbd0eSchristos
*c42dbd0eSchristos     Current metrics: e.%cycles:e+%insts:e+%llm:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
*c42dbd0eSchristos     Print limit set to 5
*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl. CPU     Excl. Instructions  Excl. Last-Level   Name
*c42dbd0eSchristos     Cycles        Executed            Cache Misses
*c42dbd0eSchristos      sec.      %                  %                 %
*c42dbd0eSchristos     2.691 100.00  7906475309 100.00   122658983 100.00   <Total>
*c42dbd0eSchristos     2.598  96.54  7432724378  94.01   121745696  99.26   mxv_core
*c42dbd0eSchristos     0.035   1.31   188860269   2.39       70084   0.06   erand48_r
*c42dbd0eSchristos     0.026   0.95    73623396   0.93      763116   0.62   init_data
*c42dbd0eSchristos     0.018   0.66    76824434   0.97       40040   0.03   drand48
*c42dbd0eSchristos
*c42dbd0eSchristos   As we have seen before, the first few lines echo the settings.  This
*c42dbd0eSchristosincludes a list with the hardware event counters used by default.
*c42dbd0eSchristos
*c42dbd0eSchristos   The table that follows makes it very easy to get an overview where
*c42dbd0eSchristosthe time is spent and how many of the target events have occurred.
*c42dbd0eSchristos
*c42dbd0eSchristos   As before, we can drill down deeper and see the same metrics at the
*c42dbd0eSchristossource line and instruction level.  Other than using 'hwc' in the
*c42dbd0eSchristosmetrics definitions, nothing has changed compared to the previous
*c42dbd0eSchristosexamples:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
*c42dbd0eSchristos     $ gprofng display text -metrics e.hwc -source mxv_core $exp
*c42dbd0eSchristos
*c42dbd0eSchristos   This is the relevant part of the output.  Since the lines get very
*c42dbd0eSchristoslong, we have somewhat modified the lay-out:
*c42dbd0eSchristos
*c42dbd0eSchristos        Excl. CPU Excl.        Excl.
*c42dbd0eSchristos        Cycles    Instructions Last-Level
*c42dbd0eSchristos         sec.     Executed     Cache Misses
*c42dbd0eSchristos                                              <Function: mxv_core>
*c42dbd0eSchristos        0.                 0          0   32. void __attribute__ ((noinline))
*c42dbd0eSchristos                                              mxv_core(...)
*c42dbd0eSchristos        0.                 0          0   33. {
*c42dbd0eSchristos        0.                 0          0   34.   for (uint64_t i=...) {
*c42dbd0eSchristos        0.                 0          0   35.     double row_sum = 0.0;
*c42dbd0eSchristos     ## 1.872     7291879319   88150571   36.     for (int64_t j=0; j<n; j++)
*c42dbd0eSchristos        0.725      140845059   33595125   37.        row_sum += A[i][j]*b[j];
*c42dbd0eSchristos        0.                 0          0   38.     c[i] = row_sum;
*c42dbd0eSchristos                                          39.    }
*c42dbd0eSchristos        0.                 0          0   40. }
*c42dbd0eSchristos
*c42dbd0eSchristos   In a smiliar way we can display the event counter values at the
*c42dbd0eSchristosinstruction level.  Again we have modified the lay-out due to page width
*c42dbd0eSchristoslimitations:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
*c42dbd0eSchristos     $ gprofng display text -metrics e.hwc -disasm mxv_core $exp
*c42dbd0eSchristos
*c42dbd0eSchristos        Excl. CPU Excl.        Excl.
*c42dbd0eSchristos        Cycles    Instructions Last-Level
*c42dbd0eSchristos         sec.     Executed     Cache Misses
*c42dbd0eSchristos                                                     <Function: mxv_core>
*c42dbd0eSchristos        0.                 0          0  [33] 4021ba: mov   0x8(%rsp),%r10
*c42dbd0eSchristos                                         34.   for (uint64_t i=...) {
*c42dbd0eSchristos        0.                 0          0  [34] 4021bf: cmp   %rsi,%rdi
*c42dbd0eSchristos        0.                 0          0  [34] 4021c2: jbe   0x37
*c42dbd0eSchristos        0.                 0          0  [34] 4021c4: ret
*c42dbd0eSchristos                                         35.       double row_sum = 0.0;
*c42dbd0eSchristos                                         36.       for (int64_t j=0; j<n; j++)
*c42dbd0eSchristos                                         37.         row_sum += A[i][j]*b[j];
*c42dbd0eSchristos        0.                 0          0  [37] 4021c5: mov   (%r8,%rdi,8),%rdx
*c42dbd0eSchristos        0.                 0          0  [36] 4021c9: mov   $0x0,%eax
*c42dbd0eSchristos        0.                 0          0  [35] 4021ce: pxor  %xmm1,%xmm1
*c42dbd0eSchristos        0.002       12804230     321394  [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
*c42dbd0eSchristos        0.141       60819025    3866677  [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
*c42dbd0eSchristos        0.582       67221804   29407054  [37] 4021dd: addsd %xmm0,%xmm1
*c42dbd0eSchristos     ## 1.871     7279075109   87989870  [36] 4021e1: add   $0x1,%rax
*c42dbd0eSchristos        0.002       12804210      80351  [36] 4021e5: cmp   %rax,%rcx
*c42dbd0eSchristos        0.                 0          0  [36] 4021e8: jne   0xffffffffffffffea
*c42dbd0eSchristos                                         38.       c[i] = row_sum;
*c42dbd0eSchristos        0.                 0          0  [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
*c42dbd0eSchristos        0.                 0          0  [34] 4021f0: add   $0x1,%rdi
*c42dbd0eSchristos        0.                 0          0  [34] 4021f4: cmp   %rdi,%rsi
*c42dbd0eSchristos        0.                 0          0  [34] 4021f7: jb    0xd
*c42dbd0eSchristos        0.                 0          0  [35] 4021f9: pxor  %xmm1,%xmm1
*c42dbd0eSchristos        0.                 0          0  [36] 4021fd: test  %rcx,%rcx
*c42dbd0eSchristos        0.                 0      80350  [36] 402200: jne   0xffffffffffffffc5
*c42dbd0eSchristos        0.                 0          0  [36] 402202: jmp   0xffffffffffffffe8
*c42dbd0eSchristos                                         39.   }
*c42dbd0eSchristos                                         40. }
*c42dbd0eSchristos        0.                 0          0  [40]  402204:  ret
*c42dbd0eSchristos
*c42dbd0eSchristos   So far we have used the default settings for the event counters.  It
*c42dbd0eSchristosis quite straightforward to select specific counters.  For sake of the
*c42dbd0eSchristosexample, let's assume we would like to count how many branch
*c42dbd0eSchristosinstructions and retired memory load instructions that missed in the L1
*c42dbd0eSchristoscache have been executed.  We also want to count these events with a
*c42dbd0eSchristoshigh resolution.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is the command to do so:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
*c42dbd0eSchristos     $ m=3000
*c42dbd0eSchristos     $ n=2000
*c42dbd0eSchristos     $ exp=mxv.hwc.sel.2.thr.er
*c42dbd0eSchristos     $ hwc1=br_ins,hi
*c42dbd0eSchristos     $ hwc2=mem_load_retired.l1_miss,hi
*c42dbd0eSchristos     $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
*c42dbd0eSchristos
*c42dbd0eSchristos   As before, we get a table with the event counts.  Due to the very
*c42dbd0eSchristoslong name for the second counter, we have somewhat modified the output.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
*c42dbd0eSchristos     Excl.     Incl.     Excl. Branch  Excl.                 Name
*c42dbd0eSchristos     Total     Total     Instructions  mem_load_retired.l1_miss
*c42dbd0eSchristos     CPU sec.  CPU sec.                Events
*c42dbd0eSchristos     2.597     2.597     1305305319    4021340               <Total>
*c42dbd0eSchristos     2.481     2.481     1233233242    3982327               mxv_core
*c42dbd0eSchristos     0.040     0.107       19019012       9003               init_data
*c42dbd0eSchristos     0.028     0.052       23023048      15006               erand48_r
*c42dbd0eSchristos     0.024     0.024       19019008       9004               __drand48_iterate
*c42dbd0eSchristos     0.015     0.067       11011009       2998               drand48
*c42dbd0eSchristos     0.008     0.010              0       3002               _int_malloc
*c42dbd0eSchristos     0.001     0.001              0          0               brk
*c42dbd0eSchristos     0.001     0.002              0          0               sysmalloc
*c42dbd0eSchristos     0.        0.001              0          0               __default_morecore
*c42dbd0eSchristos
*c42dbd0eSchristos   When using event counters, the values could be very large and it is
*c42dbd0eSchristosnot easy to compare the numbers.  As we will show next, the 'ratio'
*c42dbd0eSchristosfeature is very useful when comparing such profiles.
*c42dbd0eSchristos
*c42dbd0eSchristos   To demonstrate this, we have set up another event counter experiment
*c42dbd0eSchristoswhere we would like to compare the number of last level cache miss and
*c42dbd0eSchristosthe number of branch instructions executed when using a single thread,
*c42dbd0eSchristosor two threads.
*c42dbd0eSchristos
*c42dbd0eSchristos   These are the commands used to generate the experiment directories:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ exe=./mxv-pthreads.exe
*c42dbd0eSchristos     $ m=3000
*c42dbd0eSchristos     $ n=2000
*c42dbd0eSchristos     $ exp1=mxv.hwc.comp.1.thr.er
*c42dbd0eSchristos     $ exp2=mxv.hwc.comp.2.thr.er
*c42dbd0eSchristos     $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
*c42dbd0eSchristos     $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
*c42dbd0eSchristos
*c42dbd0eSchristos   The following script has been used to get the tables.  Due to lay-out
*c42dbd0eSchristosrestrictions, we have to create two tables, one for each counter.
*c42dbd0eSchristos
*c42dbd0eSchristos     # Limit the output to 5 lines
*c42dbd0eSchristos     limit 5
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics name:e.llm
*c42dbd0eSchristos     # Set the comparison to ratio
*c42dbd0eSchristos     compare ratio
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics name:e.br_ins
*c42dbd0eSchristos     # Set the comparison to ratio
*c42dbd0eSchristos     compare ratio
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   Note that we print the name of the function first, followed by the
*c42dbd0eSchristoscounter data.  The new element is that we set the comparison mode to
*c42dbd0eSchristos'ratio'.  This divides the data in a column by its counterpart in the
*c42dbd0eSchristosreference experiment.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is the command using this script and the two experiment
*c42dbd0eSchristosdirectories as input:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -script my-script-comp-counters \
*c42dbd0eSchristos       mxv.hwc.comp.1.thr.er \
*c42dbd0eSchristos       mxv.hwc.comp.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   By design, we get two tables, one for each counter:
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Last-Level Cache Misses
*c42dbd0eSchristos
*c42dbd0eSchristos                                   mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
*c42dbd0eSchristos     Name                          Excl. Last-Level       Excl. Last-Level
*c42dbd0eSchristos                                   Cache Misses           Cache Misses
*c42dbd0eSchristos                                                              ratio
*c42dbd0eSchristos      <Total>                      122709276              x   0.788
*c42dbd0eSchristos      mxv_core                     121796001              x   0.787
*c42dbd0eSchristos      init_data                       723064              x   1.055
*c42dbd0eSchristos      erand48_r                       100111              x   0.500
*c42dbd0eSchristos      drand48                          60065              x   1.167
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions sorted by metric: Exclusive Branch Instructions
*c42dbd0eSchristos
*c42dbd0eSchristos                                   mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
*c42dbd0eSchristos     Name                          Excl. Branch           Excl. Branch
*c42dbd0eSchristos                                   Instructions           Instructions
*c42dbd0eSchristos                                                            ratio
*c42dbd0eSchristos      <Total>                      1307307316             x 0.997
*c42dbd0eSchristos      mxv_core                     1235235239             x 0.997
*c42dbd0eSchristos      erand48_r                      23023033             x 0.957
*c42dbd0eSchristos      drand48                        20020009             x 0.600
*c42dbd0eSchristos      __drand48_iterate              17017028             x 0.882
*c42dbd0eSchristos
*c42dbd0eSchristos   A ratio less than one in the second column, means that this counter
*c42dbd0eSchristosvalue was smaller than the value from the reference experiment shown in
*c42dbd0eSchristosthe first column.
*c42dbd0eSchristos
*c42dbd0eSchristos   This kind of presentation of the results makes it much easier to
*c42dbd0eSchristosquickly interpret the data.
*c42dbd0eSchristos
*c42dbd0eSchristos   We conclude this section with thread-level event counter overviews,
*c42dbd0eSchristosbut before we go into this, there is an important metric we need to
*c42dbd0eSchristosmention.
*c42dbd0eSchristos
*c42dbd0eSchristos   In case it is known how many instructions and CPU cycles have been
*c42dbd0eSchristosexecuted, the value for the IPC ("Instructions Per Clockycle") can be
*c42dbd0eSchristoscomputed.  *Note Hardware Event Counters Explained::.  This is a derived
*c42dbd0eSchristosmetric that gives an indication how well the processor is utilized.  The
*c42dbd0eSchristosinverse of the IPC is called CPI.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'gprofng display text' command automatically computes the IPC and
*c42dbd0eSchristosCPI values if an experiment contains the event counter values for the
*c42dbd0eSchristosinstructions and CPU cycles executed.  These are part of the metric list
*c42dbd0eSchristosand can be displayed, just like any other metric.
*c42dbd0eSchristos
*c42dbd0eSchristos   This can be verified through the 'metric_list' command.  If we go
*c42dbd0eSchristosback to our earlier experiment with the default event counters, we get
*c42dbd0eSchristosthe following result.
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -metric_list mxv.hwc.def.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
*c42dbd0eSchristos     Available metrics:
*c42dbd0eSchristos              Exclusive Total CPU Time: e.%totalcpu
*c42dbd0eSchristos              Inclusive Total CPU Time: i.%totalcpu
*c42dbd0eSchristos                  Exclusive CPU Cycles: e.+%cycles
*c42dbd0eSchristos                  Inclusive CPU Cycles: i.+%cycles
*c42dbd0eSchristos       Exclusive Instructions Executed: e+%insts
*c42dbd0eSchristos       Inclusive Instructions Executed: i+%insts
*c42dbd0eSchristos     Exclusive Last-Level Cache Misses: e+%llm
*c42dbd0eSchristos     Inclusive Last-Level Cache Misses: i+%llm
*c42dbd0eSchristos      Exclusive Instructions Per Cycle: e+IPC
*c42dbd0eSchristos      Inclusive Instructions Per Cycle: i+IPC
*c42dbd0eSchristos      Exclusive Cycles Per Instruction: e+CPI
*c42dbd0eSchristos      Inclusive Cycles Per Instruction: i+CPI
*c42dbd0eSchristos                                  Size: size
*c42dbd0eSchristos                            PC Address: address
*c42dbd0eSchristos                                  Name: name
*c42dbd0eSchristos
*c42dbd0eSchristos   Among the other metrics, we see the new metrics for the IPC and CPI
*c42dbd0eSchristoslisted.
*c42dbd0eSchristos
*c42dbd0eSchristos   In the script below, we use this information and add the IPC and CPI
*c42dbd0eSchristosto the metrics to be displayed.  We also use a the thread filter to
*c42dbd0eSchristosdisplay these values for the individual threads.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is the complete script we have used.  Other than a different
*c42dbd0eSchristosselection of the metrics, there are no new features.
*c42dbd0eSchristos
*c42dbd0eSchristos     # Define the metrics
*c42dbd0eSchristos     metrics e.insts:e.%cycles:e.IPC:e.CPI
*c42dbd0eSchristos     # Sort with respect to cycles
*c42dbd0eSchristos     sort e.cycles
*c42dbd0eSchristos     # Limit the output to 5 lines
*c42dbd0eSchristos     limit 5
*c42dbd0eSchristos     # Get the function overview for all threads
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for thread 1
*c42dbd0eSchristos     thread_select 1
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for thread 2
*c42dbd0eSchristos     thread_select 2
*c42dbd0eSchristos     functions
*c42dbd0eSchristos     # Get the function overview for thread 3
*c42dbd0eSchristos     thread_select 3
*c42dbd0eSchristos     functions
*c42dbd0eSchristos
*c42dbd0eSchristos   In the metrics definition on the second line, we explicitly request
*c42dbd0eSchristosthe counter values for the instructions ('e.insts') and CPU cycles
*c42dbd0eSchristos('e.cycles') executed.  These names can be found in output from the
*c42dbd0eSchristos'metric_list' commad above.  In addition to these metrics, we also
*c42dbd0eSchristosrequest the IPC and CPI to be shown.
*c42dbd0eSchristos
*c42dbd0eSchristos   As before, we used the 'limit' command to control the number of
*c42dbd0eSchristosfunctions displayed.  We then request an overview for all the threads,
*c42dbd0eSchristosfollowed by three sets of two commands to select a thread and display
*c42dbd0eSchristosthe function overview.
*c42dbd0eSchristos
*c42dbd0eSchristos   The script above is used as follows:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
*c42dbd0eSchristos
*c42dbd0eSchristos   This script produces four tables.  We list them separately below, and
*c42dbd0eSchristoshave left out the additional output.
*c42dbd0eSchristos
*c42dbd0eSchristos   The first table shows the accumulated values across the three threads
*c42dbd0eSchristosthat have been active.
*c42dbd0eSchristos
*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
*c42dbd0eSchristos     Executed       sec.      %
*c42dbd0eSchristos     7906475309    2.691 100.00  1.473  0.679   <Total>
*c42dbd0eSchristos     7432724378    2.598  96.54  1.434  0.697   mxv_core
*c42dbd0eSchristos      188860269    0.035   1.31  2.682  0.373   erand48_r
*c42dbd0eSchristos       73623396    0.026   0.95  1.438  0.696   init_data
*c42dbd0eSchristos       76824434    0.018   0.66  2.182  0.458   drand48
*c42dbd0eSchristos
*c42dbd0eSchristos   This shows that IPC of this program is completely dominated by
*c42dbd0eSchristosfunction 'mxv_core'.  It has a fairly low IPC value of 1.43.
*c42dbd0eSchristos
*c42dbd0eSchristos   The next table is for thread 1 and shows the values for the main
*c42dbd0eSchristosthread.
*c42dbd0eSchristos
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 1       3
*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
*c42dbd0eSchristos     Executed       sec.      %
*c42dbd0eSchristos     473750931     0.093 100.00  2.552  0.392   <Total>
*c42dbd0eSchristos     188860269     0.035  37.93  2.682  0.373   erand48_r
*c42dbd0eSchristos      73623396     0.026  27.59  1.438  0.696   init_data
*c42dbd0eSchristos      76824434     0.018  18.97  2.182  0.458   drand48
*c42dbd0eSchristos     134442832     0.013  13.79  5.250  0.190   __drand48_iterate
*c42dbd0eSchristos
*c42dbd0eSchristos   Although this thread hardly uses any CPU cycles, the overall IPC of
*c42dbd0eSchristos2.55 is not all that bad.
*c42dbd0eSchristos
*c42dbd0eSchristos   Last, we show the tables for threads 2 and 3:
*c42dbd0eSchristos
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 2       3
*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
*c42dbd0eSchristos     Executed       sec.      %
*c42dbd0eSchristos     3716362189    1.298 100.00  1.435  0.697   <Total>
*c42dbd0eSchristos     3716362189    1.298 100.00  1.435  0.697   mxv_core
*c42dbd0eSchristos              0    0.      0.    0.     0.      collector_root
*c42dbd0eSchristos              0    0.      0.    0.     0.      driver_mxv
*c42dbd0eSchristos
*c42dbd0eSchristos     Exp Sel Total
*c42dbd0eSchristos     === === =====
*c42dbd0eSchristos       1 3       3
*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
*c42dbd0eSchristos
*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
*c42dbd0eSchristos     Executed       sec.      %
*c42dbd0eSchristos     3716362189    1.300 100.00  1.433  0.698   <Total>
*c42dbd0eSchristos     3716362189    1.300 100.00  1.433  0.698   mxv_core
*c42dbd0eSchristos              0    0.      0.    0.     0.      collector_root
*c42dbd0eSchristos              0    0.      0.    0.     0.      driver_mxv
*c42dbd0eSchristos
*c42dbd0eSchristos   It is seen that both execute the same number of instructions and take
*c42dbd0eSchristosabout the same number of CPU cycles.  As a result, the IPC is the same
*c42dbd0eSchristosfor both threads.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Java Profiling,  Prev: Profile Hardware Event Counters,  Up: A Mini Tutorial
*c42dbd0eSchristos
*c42dbd0eSchristos3.5 Java Profiling
*c42dbd0eSchristos==================
*c42dbd0eSchristos
*c42dbd0eSchristosThe 'gprofng collect app' command supports Java profiling.  The '-j on'
*c42dbd0eSchristosoption can be used for this, but since this feature is enabled by
*c42dbd0eSchristosdefault, there is no need to set this explicitly.  Java profiling may be
*c42dbd0eSchristosdisabled through the '-j off' option.
*c42dbd0eSchristos
*c42dbd0eSchristos   The program is compiled as usual and the experiment directory is
*c42dbd0eSchristoscreated similar to what we have seen before.  The only difference with a
*c42dbd0eSchristosC/C++ application is that the program has to be explicitly executed by
*c42dbd0eSchristosjava.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, this is how to generate the experiment data for a Java
*c42dbd0eSchristosprogram that has the source code stored in file 'Pi.java':
*c42dbd0eSchristos
*c42dbd0eSchristos     $ javac Pi.java
*c42dbd0eSchristos     $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
*c42dbd0eSchristos
*c42dbd0eSchristos   Regarding which java is selected to generate the data, 'gprofng'
*c42dbd0eSchristosfirst looks for the JDK in the path set in either the 'JDK_HOME'
*c42dbd0eSchristosenvironment variable, or in the 'JAVA_PATH' environment variable.  If
*c42dbd0eSchristosneither of these variables is set, it checks for a JDK in the search
*c42dbd0eSchristospath (set in the PATH environment variable).  If there is no JDK in this
*c42dbd0eSchristospath, it checks for the java executable in '/usr/java/bin/java'.
*c42dbd0eSchristos
*c42dbd0eSchristos   In case additional options need to be passed on to the JVM, the '-J
*c42dbd0eSchristos<string>' option can be used.  The string with the option(s) has to be
*c42dbd0eSchristosdelimited by quotation marks in case there is more than one argument.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'gprofng display text' command may be used to view the
*c42dbd0eSchristosperformance data.  There is no need for any special options and the same
*c42dbd0eSchristoscommands as previously discussed are supported.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'viewmode' command *Note The Viewmode:: is very useful to examine
*c42dbd0eSchristosthe call stacks.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, this is how one can see the native call stacks.  For
*c42dbd0eSchristoslay-out purposes we have restricted the list to the first five entries:
*c42dbd0eSchristos
*c42dbd0eSchristos     $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er
*c42dbd0eSchristos
*c42dbd0eSchristos     Print limit set to 5
*c42dbd0eSchristos     Viewmode set to machine
*c42dbd0eSchristos     Functions Call Tree. Metric: Attributed Total CPU Time
*c42dbd0eSchristos
*c42dbd0eSchristos     Attr.      Name
*c42dbd0eSchristos     Total
*c42dbd0eSchristos     CPU sec.
*c42dbd0eSchristos     1.381      +-<Total>
*c42dbd0eSchristos     1.171        +-Pi.calculatePi(double)
*c42dbd0eSchristos     0.110        +-collector_root
*c42dbd0eSchristos     0.110        |  +-JavaMain
*c42dbd0eSchristos     0.070        |    +-jni_CallStaticVoidMethod
*c42dbd0eSchristos
*c42dbd0eSchristosNote that the selection of the viewmode is echoed in the output.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Terminology,  Next: Other Document Formats,  Prev: A Mini Tutorial,  Up: Top
*c42dbd0eSchristos
*c42dbd0eSchristos4 Terminology
*c42dbd0eSchristos*************
*c42dbd0eSchristos
*c42dbd0eSchristosThroughout this manual, certain terminology specific to profiling tools,
*c42dbd0eSchristosor 'gprofng', or even to this document only, is used.  In this chapter
*c42dbd0eSchristoswe explain this terminology in detail.
*c42dbd0eSchristos
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* The Program Counter::                    What is a Program Counter?
*c42dbd0eSchristos* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
*c42dbd0eSchristos* Metric Definitions::                     Definitions associated with metrics.
*c42dbd0eSchristos* The Viewmode::                           Select the way call stacks are presented.
*c42dbd0eSchristos* The Selection List::                     How to define a selection.
*c42dbd0eSchristos* Load Objects and Functions::             The components in an application.
*c42dbd0eSchristos* The Concept of a CPU in gprofng:: The definition of a CPU.
*c42dbd0eSchristos* Hardware Event Counters Explained::      What are event counters?
*c42dbd0eSchristos* apath::                                  Our generic definition of a path.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Program Counter,  Next: Inclusive and Exclusive Metrics,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.1 The Program Counter
*c42dbd0eSchristos=======================
*c42dbd0eSchristos
*c42dbd0eSchristosThe _Program Counter_, or PC for short, keeps track where program
*c42dbd0eSchristosexecution is.  The address of the next instruction to be executed is
*c42dbd0eSchristosstored in a special purpose register in the processor, or core.
*c42dbd0eSchristos
*c42dbd0eSchristos   The PC is sometimes also referred to as the _instruction pointer_,
*c42dbd0eSchristosbut we will use Program Counter or PC throughout this document.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Inclusive and Exclusive Metrics,  Next: Metric Definitions,  Prev: The Program Counter,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.2 Inclusive and Exclusive Metrics
*c42dbd0eSchristos===================================
*c42dbd0eSchristos
*c42dbd0eSchristosIn the remainder, these two concepts occur quite often and for lack of a
*c42dbd0eSchristosbetter place, they are explained here.
*c42dbd0eSchristos
*c42dbd0eSchristos   The _inclusive_ value for a metric includes all values that are part
*c42dbd0eSchristosof the dynamic extent of the target function.  For example if function
*c42dbd0eSchristos'A' calls functions 'B' and 'C', the inclusive CPU time for 'A' includes
*c42dbd0eSchristosthe CPU time spent in 'B' and 'C'.
*c42dbd0eSchristos
*c42dbd0eSchristos   In contrast with this, the _exclusive_ value for a metric is computed
*c42dbd0eSchristosby excluding the metric values used by other functions called.  In our
*c42dbd0eSchristosimaginary example, the exclusive CPU time for function 'A' is the time
*c42dbd0eSchristosspent outside calling functions 'B' and 'C'.
*c42dbd0eSchristos
*c42dbd0eSchristos   In case of a _leaf function_, the inclusive and exclusive values for
*c42dbd0eSchristosthe metric are the same since by definition, it is not calling any other
*c42dbd0eSchristosfunction(s).
*c42dbd0eSchristos
*c42dbd0eSchristos   Why do we use these two different values?  The inclusive metric shows
*c42dbd0eSchristosthe most expensive path, in terms of this metric, in the application.
*c42dbd0eSchristosFor example, if the metric is cache misses, the function with the
*c42dbd0eSchristoshighest inclusive metric tells you where most of the cache misses come
*c42dbd0eSchristosfrom.
*c42dbd0eSchristos
*c42dbd0eSchristos   Within this branch of the application, the exclusive metric points to
*c42dbd0eSchristosthe functions that contribute and help to identify which part(s) to
*c42dbd0eSchristosconsider for further analysis.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Metric Definitions,  Next: The Viewmode,  Prev: Inclusive and Exclusive Metrics,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.3 Metric Definitions
*c42dbd0eSchristos======================
*c42dbd0eSchristos
*c42dbd0eSchristosThe metrics to be shown are highly customizable.  In this section we
*c42dbd0eSchristosexplain the definitions associated with metrics.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'metrics' command takes a colon (:) separated list with special
*c42dbd0eSchristoskeywords.  This keyword consists of the following three fields:
*c42dbd0eSchristos'<flavor>''<visibility>''<metric_name>'.
*c42dbd0eSchristos
*c42dbd0eSchristos   The _<flavor>_ field is either an 'e' for "exclusive", or 'i' for
*c42dbd0eSchristos"inclusive".  The '<metric_name>' field is the name of the metric
*c42dbd0eSchristosrequest.  The _<visibility>_ field consists of one ore more characters
*c42dbd0eSchristosfrom the following table:
*c42dbd0eSchristos
*c42dbd0eSchristos'.'
*c42dbd0eSchristos     Show the metric as time.  This applies to timing metrics and
*c42dbd0eSchristos     hardware event counters that measure cycles.  Interpret as '+' for
*c42dbd0eSchristos     other metrics.
*c42dbd0eSchristos
*c42dbd0eSchristos'%'
*c42dbd0eSchristos     Show the metric as a percentage of the total value for this metric.
*c42dbd0eSchristos
*c42dbd0eSchristos'+'
*c42dbd0eSchristos     Show the metric as an absolute value.  For hardware event counters
*c42dbd0eSchristos     this is the event count.  Interpret as '.' for timing metrics.
*c42dbd0eSchristos
*c42dbd0eSchristos'|'
*c42dbd0eSchristos     Do not show any metric value.  Cannot be used with other visibility
*c42dbd0eSchristos     characters.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Viewmode,  Next: The Selection List,  Prev: Metric Definitions,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.4 The Viewmode
*c42dbd0eSchristos================
*c42dbd0eSchristos
*c42dbd0eSchristosThere are different ways to view a call stack in Java.  In 'gprofng',
*c42dbd0eSchristosthis is called the _viewmode_ and the setting is controlled through a
*c42dbd0eSchristoscommand with the same name.
*c42dbd0eSchristos
*c42dbd0eSchristos   The 'viewmode' command takes one of the following keywords:
*c42dbd0eSchristos
*c42dbd0eSchristos'user'
*c42dbd0eSchristos     This is the default and shows the Java call stacks for Java
*c42dbd0eSchristos     threads.  No call stacks for any housekeeping threads are shown.
*c42dbd0eSchristos     The function list contains a function '<JVM-System>' that
*c42dbd0eSchristos     represents the aggregated time from non-Java threads.  When the JVM
*c42dbd0eSchristos     software does not report a Java call stack, time is reported
*c42dbd0eSchristos     against the function '<no Java callstack recorded>'.
*c42dbd0eSchristos
*c42dbd0eSchristos'expert'
*c42dbd0eSchristos     Show the Java call stacks for Java threads when the Java code from
*c42dbd0eSchristos     the user is executed and machine call stacks when JVM code is
*c42dbd0eSchristos     executed, or when the JVM software does not report a Java call
*c42dbd0eSchristos     stack.  Show the machine call stacks for housekeeping threads.
*c42dbd0eSchristos
*c42dbd0eSchristos'machine'
*c42dbd0eSchristos     Show the actual native call stacks for all threads.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Selection List,  Next: Load Objects and Functions,  Prev: The Viewmode,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.5 The Selection List
*c42dbd0eSchristos======================
*c42dbd0eSchristos
*c42dbd0eSchristosSeveral commands allow the user to specify a subset of a list.  For
*c42dbd0eSchristosexample, to select specific threads from all the threads that have been
*c42dbd0eSchristosused when conducting the experiment(s).
*c42dbd0eSchristos
*c42dbd0eSchristos   Such a selection list (or "list" in the remainder of this section)
*c42dbd0eSchristoscan be a single number, a contiguous range of numbers with the start and
*c42dbd0eSchristosend numbers separated by a hyphen ('-'), a comma-separated list of
*c42dbd0eSchristosnumbers and ranges, or the 'all' keyword.  Lists must not contain
*c42dbd0eSchristosspaces.
*c42dbd0eSchristos
*c42dbd0eSchristos   Each list can optionally be preceded by an experiment list with a
*c42dbd0eSchristossimilar format, separated from the list by a colon (:).  If no
*c42dbd0eSchristosexperiment list is included, the list applies to all experiments.
*c42dbd0eSchristos
*c42dbd0eSchristos   Multiple lists can be concatenated by separating the individual lists
*c42dbd0eSchristosby a plus sign.
*c42dbd0eSchristos
*c42dbd0eSchristos   These are some examples of various filters using a list:
*c42dbd0eSchristos
*c42dbd0eSchristos'thread_select 1'
*c42dbd0eSchristos     Select thread 1 from all experiments.
*c42dbd0eSchristos
*c42dbd0eSchristos'thread_select all:1'
*c42dbd0eSchristos     Select thread 1 from all experiments.
*c42dbd0eSchristos
*c42dbd0eSchristos'thread_select 1:1+2:2'
*c42dbd0eSchristos     Select thread 1 from experiment 1 and thread 2 from experiment 2.
*c42dbd0eSchristos
*c42dbd0eSchristos'cpu_select all:1,3,5'
*c42dbd0eSchristos     Selects cores 1, 3, and 5 from all experiments.
*c42dbd0eSchristos
*c42dbd0eSchristos'cpu_select 1,2:all'
*c42dbd0eSchristos     Select all cores from experiments 1 and 2, as listed by the 'by
*c42dbd0eSchristos     exp_list' command.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Load Objects and Functions,  Next: The Concept of a CPU in gprofng,  Prev: The Selection List,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.6 Load Objects and Functions
*c42dbd0eSchristos==============================
*c42dbd0eSchristos
*c42dbd0eSchristosAn application consists of various components.  The source code files
*c42dbd0eSchristosare compiled into object files.  These are then glued together at link
*c42dbd0eSchristostime to form the executable.  During execution, the program may also
*c42dbd0eSchristosdynamically load objects.
*c42dbd0eSchristos
*c42dbd0eSchristos   A _load object_ is defined to be an executable, or shared object.  A
*c42dbd0eSchristosshared library is an example of a load object in 'gprofng'.
*c42dbd0eSchristos
*c42dbd0eSchristos   Each load object, contains a text section with the instructions
*c42dbd0eSchristosgenerated by the compiler, a data section for data, and various symbol
*c42dbd0eSchristostables.  All load objects must contain an ELF symbol table, which gives
*c42dbd0eSchristosthe names and addresses of all the globally known functions in that
*c42dbd0eSchristosobject.
*c42dbd0eSchristos
*c42dbd0eSchristos   Load objects compiled with the -g option contain additional symbolic
*c42dbd0eSchristosinformation that can augment the ELF symbol table and provide
*c42dbd0eSchristosinformation about functions that are not global, additional information
*c42dbd0eSchristosabout object modules from which the functions came, and line number
*c42dbd0eSchristosinformation relating addresses to source lines.
*c42dbd0eSchristos
*c42dbd0eSchristos   The term _function_ is used to describe a set of instructions that
*c42dbd0eSchristosrepresent a high-level operation described in the source code.  The term
*c42dbd0eSchristosalso covers methods as used in C++ and in the Java programming language.
*c42dbd0eSchristos
*c42dbd0eSchristos   In the 'gprofng' context, functions are provided in source code
*c42dbd0eSchristosformat.  Normally their names appear in the symbol table representing a
*c42dbd0eSchristosset of addresses.  If the Program Counter (PC) is within that set, the
*c42dbd0eSchristosprogram is executing within that function.
*c42dbd0eSchristos
*c42dbd0eSchristos   In principle, any address within the text segment of a load object
*c42dbd0eSchristoscan be mapped to a function.  Exactly the same mapping is used for the
*c42dbd0eSchristosleaf PC and all the other PCs on the call stack.
*c42dbd0eSchristos
*c42dbd0eSchristos   Most of the functions correspond directly to the source model of the
*c42dbd0eSchristosprogram, but there are exceptions.  This topic is however outside of the
*c42dbd0eSchristosscope of this guide.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: The Concept of a CPU in gprofng,  Next: Hardware Event Counters Explained,  Prev: Load Objects and Functions,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.7 The Concept of a CPU in gprofng
*c42dbd0eSchristos===================================
*c42dbd0eSchristos
*c42dbd0eSchristosIn gprofng, there is the concept of a CPU. Admittedly, this is not the
*c42dbd0eSchristosbest word to describe what is meant here and may be replaced in the
*c42dbd0eSchristosfuture.
*c42dbd0eSchristos
*c42dbd0eSchristos   The word CPU is used in many of the displays.  In the context of
*c42dbd0eSchristosgprofng, it is meant to denote a part of the processor that is capable
*c42dbd0eSchristosof executing instructions and with its own state, like the program
*c42dbd0eSchristoscounter.
*c42dbd0eSchristos
*c42dbd0eSchristos   For example, on a contemporary processor, a CPU could be a core.  In
*c42dbd0eSchristoscase hardware threads are supported within a core, it could be one of
*c42dbd0eSchristosthose hardware threads.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Hardware Event Counters Explained,  Next: apath,  Prev: The Concept of a CPU in gprofng,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.8 Hardware Event Counters Explained
*c42dbd0eSchristos=====================================
*c42dbd0eSchristos
*c42dbd0eSchristosFor quite a number of years now, many microprocessors have supported
*c42dbd0eSchristoshardware event counters.
*c42dbd0eSchristos
*c42dbd0eSchristos   On the hardware side, this means that in the processor there are one
*c42dbd0eSchristosor more registers dedicated to count certain activities, or "events".
*c42dbd0eSchristosExamples of such events are the number of instructions executed, or the
*c42dbd0eSchristosnumber of cache misses at level 2 in the memory hierarchy.
*c42dbd0eSchristos
*c42dbd0eSchristos   While there is a limited set of such registers, the user can map
*c42dbd0eSchristosevents onto them.  In case more than one register is available, this
*c42dbd0eSchristosallows for the simultaenous measurement of various events.
*c42dbd0eSchristos
*c42dbd0eSchristos   A simple, yet powerful, example is to simultaneously count the number
*c42dbd0eSchristosof CPU cycles and the number of instructions excuted.  These two numbers
*c42dbd0eSchristoscan then be used to compute the _IPC_ value.  IPC stands for
*c42dbd0eSchristos"Instructions Per Clockcycle" and each processor has a maximum.  For
*c42dbd0eSchristosexample, if this maximum number is 2, it means the processor is capable
*c42dbd0eSchristosof executing two instructions every clock cycle.
*c42dbd0eSchristos
*c42dbd0eSchristos   Whether this is actually achieved, depends on several factors,
*c42dbd0eSchristosincluding the instruction characteristics.  However, in case the IPC
*c42dbd0eSchristosvalue is well below this maximum in a time critical part of the
*c42dbd0eSchristosapplication and this cannot be easily explained, further investigation
*c42dbd0eSchristosis probably warranted.
*c42dbd0eSchristos
*c42dbd0eSchristos   A related metric is called _CPI_, or "Clockcycles Per Instruction".
*c42dbd0eSchristosIt is the inverse of the CPI and can be compared against the theoretical
*c42dbd0eSchristosvalue(s) of the target instruction(s).  A significant difference may
*c42dbd0eSchristospoint at a bottleneck.
*c42dbd0eSchristos
*c42dbd0eSchristos   One thing to keep in mind is that the value returned by a counter can
*c42dbd0eSchristoseither be the number of times the event occured, or a CPU cycle count.
*c42dbd0eSchristosIn case of the latter it is possible to convert this number to time.
*c42dbd0eSchristos
*c42dbd0eSchristos   This is often easier to interpret than a simple count, but there is
*c42dbd0eSchristosone caveat to keep in mind.  The CPU frequency may not have been
*c42dbd0eSchristosconstant while the experimen was recorded and this impacts the time
*c42dbd0eSchristosreported.
*c42dbd0eSchristos
*c42dbd0eSchristos   These event counters, or "counters" for short, provide great insight
*c42dbd0eSchristosinto what happens deep inside the processor.  In case higher level
*c42dbd0eSchristosinformation does not provide the insight needed, the counters provide
*c42dbd0eSchristosthe information to get to the bottom of a performance problem.
*c42dbd0eSchristos
*c42dbd0eSchristos   There are some things to consider though.
*c42dbd0eSchristos
*c42dbd0eSchristos   * The event definitions and names vary across processors and it may
*c42dbd0eSchristos     even happen that some events change with an update.  Unfortunately
*c42dbd0eSchristos     and this is luckily rare, there are sometimes bugs causing the
*c42dbd0eSchristos     wrong count to be returned.
*c42dbd0eSchristos
*c42dbd0eSchristos     In 'gprofng', some of the processor specific event names have an
*c42dbd0eSchristos     alias name.  For example 'insts' measures the instructions
*c42dbd0eSchristos     executed.  These aliases not only makes it easier to identify the
*c42dbd0eSchristos     functionality, but also provide portability of certain events
*c42dbd0eSchristos     across processors.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Another complexity is that there are typically many events one can
*c42dbd0eSchristos     monitor.  There may up to hundreds of events available and it could
*c42dbd0eSchristos     require several experiments to zoom in on the root cause of a
*c42dbd0eSchristos     performance problem.
*c42dbd0eSchristos
*c42dbd0eSchristos   * There may be restrictions regarding the mapping of event(s) onto
*c42dbd0eSchristos     the counters.  For example, certain events may be restricted to
*c42dbd0eSchristos     specific counters only.  As a result, one may have to conduct
*c42dbd0eSchristos     additional experiments to cover all the events of interest.
*c42dbd0eSchristos
*c42dbd0eSchristos   * The names of the events may also not be easy to interpret.  In such
*c42dbd0eSchristos     cases, the description can be found in the architecture manual for
*c42dbd0eSchristos     the processor.
*c42dbd0eSchristos
*c42dbd0eSchristos   Despite these drawbacks, hardware event counters are extremely useful
*c42dbd0eSchristosand may even turn out to be indispensable.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: apath,  Prev: Hardware Event Counters Explained,  Up: Terminology
*c42dbd0eSchristos
*c42dbd0eSchristos4.9 What is <apath>?
*c42dbd0eSchristos====================
*c42dbd0eSchristos
*c42dbd0eSchristosIn most cases, 'gprofng' shows the absolute pathnames of directories.
*c42dbd0eSchristosThese tend to be rather long, causing display issues in this document.
*c42dbd0eSchristos
*c42dbd0eSchristos   Instead of wrapping these long pathnames over multiple lines, we
*c42dbd0eSchristosdecided to represent them by the '<apath>' symbol, which stands for "an
*c42dbd0eSchristosabsolute pathname".
*c42dbd0eSchristos
*c42dbd0eSchristos   Note that different occurrences of '<apath>' may represent different
*c42dbd0eSchristosabsolute pathnames.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Other Document Formats,  Next: Index,  Prev: Terminology,  Up: Top
*c42dbd0eSchristos
*c42dbd0eSchristos5 Other Document Formats
*c42dbd0eSchristos************************
*c42dbd0eSchristos
*c42dbd0eSchristosThis document is written in Texinfo and the source text is made
*c42dbd0eSchristosavailable as part of the binutils distribution.  The file name is
*c42dbd0eSchristos'gprofng.texi' and can be found in subdirectory 'doc' under directory
*c42dbd0eSchristos'gprofng' in the top level directory.
*c42dbd0eSchristos
*c42dbd0eSchristos   This file can be used to generate the document in the 'info', 'html',
*c42dbd0eSchristosand 'pdf' formats.  The default installation procedure creates a file in
*c42dbd0eSchristosthe 'info' format and stores it in the documentation section of
*c42dbd0eSchristosbinutils.
*c42dbd0eSchristos
*c42dbd0eSchristos   The probably easiest way to generate a different format from this
*c42dbd0eSchristosTexinfo document is to go to the distribution directory that was created
*c42dbd0eSchristoswhen the tools were built.  This is either the default distribution
*c42dbd0eSchristosdirectory, or the one that has been set with the '--prefix' option as
*c42dbd0eSchristospart of the 'configure' command.  In this example we symbolize this
*c42dbd0eSchristoslocation with '<dist>'.
*c42dbd0eSchristos
*c42dbd0eSchristos   The make file called 'Makefile' in directory '<dist>/gprofng/doc'
*c42dbd0eSchristossupports several commands to generate this document in different
*c42dbd0eSchristosformats.  We recommend to use these commands.
*c42dbd0eSchristos
*c42dbd0eSchristos   They create the file(s) and install it in the documentation directory
*c42dbd0eSchristosof binutils, which is '<dist>/share/doc' in case 'html' or 'pdf' is
*c42dbd0eSchristosselected and '<dist>/share/info' for the file in the 'info' format.
*c42dbd0eSchristos
*c42dbd0eSchristos   To generate this document in the requested format and install it in
*c42dbd0eSchristosthe documentation directory, the commands below should be executed.  In
*c42dbd0eSchristosthis notation, '<format>' is one of 'info', 'html', or 'pdf':
*c42dbd0eSchristos
*c42dbd0eSchristos     $ cd <dist>/gprofng/doc
*c42dbd0eSchristos     $ make install-<format>
*c42dbd0eSchristos
*c42dbd0eSchristosSome things to note:
*c42dbd0eSchristos
*c42dbd0eSchristos   * For the 'pdf' file to be generated, the TeX document formatting
*c42dbd0eSchristos     software is required and the relevant commmands need to be included
*c42dbd0eSchristos     in the search path.  An example of a popular TeX implementation is
*c42dbd0eSchristos     _TexLive_.  It is beyond the scope of this document to go into the
*c42dbd0eSchristos     details of installing and using TeX, but it is well documented
*c42dbd0eSchristos     elsewhere.
*c42dbd0eSchristos
*c42dbd0eSchristos   * Instead of generating a single file in the 'html' format, it is
*c42dbd0eSchristos     also possible to create a directory with individual files for the
*c42dbd0eSchristos     various chapters.  To do so, remove the use of '--no-split' in
*c42dbd0eSchristos     variable 'MAKEINFOHTML' in the make file in the 'doc' directory.
*c42dbd0eSchristos
*c42dbd0eSchristos   * The make file also supports commands to only generate the file in
*c42dbd0eSchristos     the desired format and not move them to the documentation
*c42dbd0eSchristos     directory.  This is accomplished through the 'make <format>'
*c42dbd0eSchristos     command.
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosFile: gprofng.info,  Node: Index,  Prev: Other Document Formats,  Up: Top
*c42dbd0eSchristos
*c42dbd0eSchristosIndex
*c42dbd0eSchristos*****
*c42dbd0eSchristos
*c42dbd0eSchristos�[index�]
*c42dbd0eSchristos* Menu:
*c42dbd0eSchristos
*c42dbd0eSchristos* Command line mode:                     A First Profile.     (line  39)
*c42dbd0eSchristos* Commands, calltree:                    The Call Tree.       (line  10)
*c42dbd0eSchristos* Commands, compare delta:               Comparison of Experiments.
*c42dbd0eSchristos                                                              (line  52)
*c42dbd0eSchristos* Commands, compare on/off:              Comparison of Experiments.
*c42dbd0eSchristos                                                              (line   7)
*c42dbd0eSchristos* Commands, compare on/off <1>:          Comparison of Experiments.
*c42dbd0eSchristos                                                              (line  51)
*c42dbd0eSchristos* Commands, compare ratio:               Comparison of Experiments.
*c42dbd0eSchristos                                                              (line  52)
*c42dbd0eSchristos* Commands, compare ratio <1>:           Examples Using Hardware Event Counters.
*c42dbd0eSchristos                                                              (line 166)
*c42dbd0eSchristos* Commands, cpus:                        Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line 177)
*c42dbd0eSchristos* Commands, cpu_list:                    Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line 176)
*c42dbd0eSchristos* Commands, disasm:                      The Disassembly View.
*c42dbd0eSchristos                                                              (line   8)
*c42dbd0eSchristos* Commands, experiment_list:             Aggregation of Experiments.
*c42dbd0eSchristos                                                              (line  21)
*c42dbd0eSchristos* Commands, fsingle:                     Information on Load Objects.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* Commands, fsingle <1>:                 Information on Load Objects.
*c42dbd0eSchristos                                                              (line  36)
*c42dbd0eSchristos* Commands, fsummary:                    Information on Load Objects.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* Commands, fsummary <1>:                Information on Load Objects.
*c42dbd0eSchristos                                                              (line  62)
*c42dbd0eSchristos* Commands, functions:                   A First Profile.     (line  44)
*c42dbd0eSchristos* Commands, header:                      More Information on the Experiment.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* Commands, header <1>:                  Control the Sampling Frequency.
*c42dbd0eSchristos                                                              (line  59)
*c42dbd0eSchristos* Commands, limit:                       Control the Number of Lines in the Output.
*c42dbd0eSchristos                                                              (line   6)
*c42dbd0eSchristos* Commands, lines:                       The Source Code View.
*c42dbd0eSchristos                                                              (line  68)
*c42dbd0eSchristos* Commands, metrics:                     Display and Define the Metrics.
*c42dbd0eSchristos                                                              (line  11)
*c42dbd0eSchristos* Commands, metrics <1>:                 Display and Define the Metrics.
*c42dbd0eSchristos                                                              (line  38)
*c42dbd0eSchristos* Commands, metrics <2>:                 Metric Definitions.  (line   9)
*c42dbd0eSchristos* Commands, metric_list:                 Display and Define the Metrics.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* Commands, metric_list <1>:             Display and Define the Metrics.
*c42dbd0eSchristos                                                              (line  19)
*c42dbd0eSchristos* Commands, metric_list <2>:             Examples Using Hardware Event Counters.
*c42dbd0eSchristos                                                              (line 261)
*c42dbd0eSchristos* Commands, objects:                     Information on Load Objects.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* Commands, overview:                    More Information on the Experiment.
*c42dbd0eSchristos                                                              (line  54)
*c42dbd0eSchristos* Commands, pcs:                         The Disassembly View.
*c42dbd0eSchristos                                                              (line  73)
*c42dbd0eSchristos* Commands, script:                      Scripting.           (line  11)
*c42dbd0eSchristos* Commands, sort:                        Sorting the Performance Data.
*c42dbd0eSchristos                                                              (line   6)
*c42dbd0eSchristos* Commands, source:                      The Source Code View.
*c42dbd0eSchristos                                                              (line  12)
*c42dbd0eSchristos* Commands, threads:                     Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line  35)
*c42dbd0eSchristos* Commands, thread_list:                 Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* Commands, thread_select:               Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line  64)
*c42dbd0eSchristos* Commands, thread_select <1>:           Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line  88)
*c42dbd0eSchristos* Commands, viewmode:                    Java Profiling.      (line  37)
*c42dbd0eSchristos* Commands, viewmode <1>:                The Viewmode.        (line   6)
*c42dbd0eSchristos* Compare experiments:                   Comparison of Experiments.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* CPI:                                   Hardware Event Counters Explained.
*c42dbd0eSchristos                                                              (line  31)
*c42dbd0eSchristos* CPU:                                   The Concept of a CPU in gprofng.
*c42dbd0eSchristos                                                              (line   6)
*c42dbd0eSchristos* Default metrics:                       Display and Define the Metrics.
*c42dbd0eSchristos                                                              (line  34)
*c42dbd0eSchristos* ELF:                                   Load Objects and Functions.
*c42dbd0eSchristos                                                              (line  16)
*c42dbd0eSchristos* Exclusive metric:                      Inclusive and Exclusive Metrics.
*c42dbd0eSchristos                                                              (line  14)
*c42dbd0eSchristos* Experiment directory:                  Steps Needed to Create a Profile.
*c42dbd0eSchristos                                                              (line  21)
*c42dbd0eSchristos* Filters, Thread selection:             Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line  64)
*c42dbd0eSchristos* Flavor field:                          Metric Definitions.  (line  13)
*c42dbd0eSchristos* Function:                              Load Objects and Functions.
*c42dbd0eSchristos                                                              (line  26)
*c42dbd0eSchristos* gprofng display html:                  Steps Needed to Create a Profile.
*c42dbd0eSchristos                                                              (line  46)
*c42dbd0eSchristos* gprofng display text:                  Steps Needed to Create a Profile.
*c42dbd0eSchristos                                                              (line  42)
*c42dbd0eSchristos* Hardware event counters, alias name:   Hardware Event Counters Explained.
*c42dbd0eSchristos                                                              (line  57)
*c42dbd0eSchristos* Hardware event counters, auto option:  Examples Using Hardware Event Counters.
*c42dbd0eSchristos                                                              (line  21)
*c42dbd0eSchristos* Hardware event counters, counter definition: Getting Information on the Counters Supported.
*c42dbd0eSchristos                                                              (line  85)
*c42dbd0eSchristos* Hardware event counters, description:  Hardware Event Counters Explained.
*c42dbd0eSchristos                                                              (line   6)
*c42dbd0eSchristos* Hardware event counters, hwc metric:   Examples Using Hardware Event Counters.
*c42dbd0eSchristos                                                              (line  31)
*c42dbd0eSchristos* Hardware event counters, IPC:          Examples Using Hardware Event Counters.
*c42dbd0eSchristos                                                              (line 250)
*c42dbd0eSchristos* Hardware event counters, variable CPU frequency: Hardware Event Counters Explained.
*c42dbd0eSchristos                                                              (line  40)
*c42dbd0eSchristos* Inclusive metric:                      Inclusive and Exclusive Metrics.
*c42dbd0eSchristos                                                              (line   9)
*c42dbd0eSchristos* Instruction level timings:             The Disassembly View.
*c42dbd0eSchristos                                                              (line   9)
*c42dbd0eSchristos* Instruction pointer:                   The Program Counter. (line  10)
*c42dbd0eSchristos* Interpreter mode:                      A First Profile.     (line  30)
*c42dbd0eSchristos* IPC:                                   Hardware Event Counters Explained.
*c42dbd0eSchristos                                                              (line  20)
*c42dbd0eSchristos* Java profiling, -J <string>:           Java Profiling.      (line  29)
*c42dbd0eSchristos* Java profiling, -j on/off:             Java Profiling.      (line   6)
*c42dbd0eSchristos* Java profiling, <JVM-System>:          The Viewmode.        (line  15)
*c42dbd0eSchristos* Java profiling, <no Java callstack recorded>: The Viewmode. (line  18)
*c42dbd0eSchristos* Java profiling, different view modes:  Java Profiling.      (line  37)
*c42dbd0eSchristos* Java profiling, JAVA_PATH:             Java Profiling.      (line  24)
*c42dbd0eSchristos* Java profiling, JDK_HOME:              Java Profiling.      (line  23)
*c42dbd0eSchristos* Leaf function:                         Inclusive and Exclusive Metrics.
*c42dbd0eSchristos                                                              (line  19)
*c42dbd0eSchristos* List specification:                    The Selection List.  (line   6)
*c42dbd0eSchristos* Load object:                           Load Objects and Functions.
*c42dbd0eSchristos                                                              (line  11)
*c42dbd0eSchristos* Load objects:                          Information on Load Objects.
*c42dbd0eSchristos                                                              (line  11)
*c42dbd0eSchristos* Metric name field:                     Metric Definitions.  (line  13)
*c42dbd0eSchristos* Miscellaneous , ##:                    The Source Code View.
*c42dbd0eSchristos                                                              (line  63)
*c42dbd0eSchristos* Miscellaneous, <apath>:                Information on Load Objects.
*c42dbd0eSchristos                                                              (line  16)
*c42dbd0eSchristos* Miscellaneous, <Total>:                A First Profile.     (line  86)
*c42dbd0eSchristos* mxv-pthreads.exe:                      The Example Program. (line  12)
*c42dbd0eSchristos* Options, -C:                           More Information on the Experiment.
*c42dbd0eSchristos                                                              (line  48)
*c42dbd0eSchristos* Options, -h:                           Getting Information on the Counters Supported.
*c42dbd0eSchristos                                                              (line   9)
*c42dbd0eSchristos* Options, -h <1>:                       Examples Using Hardware Event Counters.
*c42dbd0eSchristos                                                              (line  21)
*c42dbd0eSchristos* Options, -o:                           Name the Experiment Directory.
*c42dbd0eSchristos                                                              (line  15)
*c42dbd0eSchristos* Options, -O:                           Name the Experiment Directory.
*c42dbd0eSchristos                                                              (line  19)
*c42dbd0eSchristos* Options, -O <1>:                       A More Elaborate Example.
*c42dbd0eSchristos                                                              (line  12)
*c42dbd0eSchristos* Options, -p:                           The Call Tree.       (line  66)
*c42dbd0eSchristos* Options, -p <1>:                       Control the Sampling Frequency.
*c42dbd0eSchristos                                                              (line  18)
*c42dbd0eSchristos* PC:                                    The Program Counter. (line   6)
*c42dbd0eSchristos* PC <1>:                                Load Objects and Functions.
*c42dbd0eSchristos                                                              (line  32)
*c42dbd0eSchristos* PC sampling:                           Sampling versus Tracing.
*c42dbd0eSchristos                                                              (line   7)
*c42dbd0eSchristos* Posix Threads:                         The Example Program. (line   8)
*c42dbd0eSchristos* Program Counter:                       The Program Counter. (line   6)
*c42dbd0eSchristos* Program Counter <1>:                   Load Objects and Functions.
*c42dbd0eSchristos                                                              (line  32)
*c42dbd0eSchristos* Program Counter sampling:              Sampling versus Tracing.
*c42dbd0eSchristos                                                              (line   7)
*c42dbd0eSchristos* Pthreads:                              The Example Program. (line   8)
*c42dbd0eSchristos* Sampling interval:                     Control the Sampling Frequency.
*c42dbd0eSchristos                                                              (line  20)
*c42dbd0eSchristos* Selection list:                        The Selection List.  (line   6)
*c42dbd0eSchristos* Source level timings:                  The Source Code View.
*c42dbd0eSchristos                                                              (line  10)
*c42dbd0eSchristos* TeX:                                   Other Document Formats.
*c42dbd0eSchristos                                                              (line  40)
*c42dbd0eSchristos* Thread affinity:                       Commands Specific to Multithreading.
*c42dbd0eSchristos                                                              (line 162)
*c42dbd0eSchristos* Total CPU time:                        A First Profile.     (line  74)
*c42dbd0eSchristos* Viewmode:                              The Viewmode.        (line   6)
*c42dbd0eSchristos* Visibility field:                      Sorting the Performance Data.
*c42dbd0eSchristos                                                              (line   9)
*c42dbd0eSchristos* Visibility field <1>:                  Metric Definitions.  (line  13)
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosTag Table:
*c42dbd0eSchristosNode: Top750
*c42dbd0eSchristosNode: Introduction3075
*c42dbd0eSchristosNode: Overview4352
*c42dbd0eSchristosNode: Main Features4993
*c42dbd0eSchristosNode: Sampling versus Tracing6733
*c42dbd0eSchristosNode: Steps Needed to Create a Profile9124
*c42dbd0eSchristosNode: A Mini Tutorial11311
*c42dbd0eSchristosNode: Getting Started12030
*c42dbd0eSchristosNode: The Example Program13680
*c42dbd0eSchristosNode: A First Profile14800
*c42dbd0eSchristosNode: The Source Code View19305
*c42dbd0eSchristosNode: The Disassembly View23801
*c42dbd0eSchristosNode: Display and Define the Metrics28490
*c42dbd0eSchristosNode: A First Customization of the Output30326
*c42dbd0eSchristosNode: Name the Experiment Directory32175
*c42dbd0eSchristosNode: Control the Number of Lines in the Output33272
*c42dbd0eSchristosNode: Sorting the Performance Data34031
*c42dbd0eSchristosNode: Scripting34973
*c42dbd0eSchristosNode: A More Elaborate Example35892
*c42dbd0eSchristosNode: The Call Tree38682
*c42dbd0eSchristosNode: More Information on the Experiment41423
*c42dbd0eSchristosNode: Control the Sampling Frequency44722
*c42dbd0eSchristosNode: Information on Load Objects47143
*c42dbd0eSchristosNode: Support for Multithreading50782
*c42dbd0eSchristosNode: Creating a Multithreading Experiment51422
*c42dbd0eSchristosNode: Commands Specific to Multithreading52935
*c42dbd0eSchristosNode: Viewing Multiple Experiments62341
*c42dbd0eSchristosNode: Aggregation of Experiments62918
*c42dbd0eSchristosNode: Comparison of Experiments65232
*c42dbd0eSchristosNode: Profile Hardware Event Counters69848
*c42dbd0eSchristosNode: Getting Information on the Counters Supported70556
*c42dbd0eSchristosNode: Examples Using Hardware Event Counters78156
*c42dbd0eSchristosNode: Java Profiling95853
*c42dbd0eSchristosNode: Terminology98222
*c42dbd0eSchristosNode: The Program Counter99234
*c42dbd0eSchristosNode: Inclusive and Exclusive Metrics99726
*c42dbd0eSchristosNode: Metric Definitions101179
*c42dbd0eSchristosNode: The Viewmode102364
*c42dbd0eSchristosNode: The Selection List103503
*c42dbd0eSchristosNode: Load Objects and Functions104905
*c42dbd0eSchristosNode: The Concept of a CPU in gprofng106919
*c42dbd0eSchristosNode: Hardware Event Counters Explained107680
*c42dbd0eSchristosNode: apath111472
*c42dbd0eSchristosNode: Other Document Formats112007
*c42dbd0eSchristosNode: Index114531
*c42dbd0eSchristos
*c42dbd0eSchristosEnd Tag Table
*c42dbd0eSchristos
*c42dbd0eSchristos
*c42dbd0eSchristosLocal Variables:
*c42dbd0eSchristoscoding: utf-8
*c42dbd0eSchristosEnd: