xref: /netbsd-src/external/gpl3/binutils.old/dist/gprofng/doc/gprofng.info (revision c42dbd0ed2e61fe6eda8590caa852ccf34719964)
1*c42dbd0eSchristosThis is gprofng.info, produced by makeinfo version 6.8 from
2*c42dbd0eSchristosgprofng.texi.
3*c42dbd0eSchristos
4*c42dbd0eSchristosThis document is the manual for gprofng, last updated 22 February 2022.
5*c42dbd0eSchristos
6*c42dbd0eSchristos   Copyright (C) 2022 Free Software Foundation, Inc.
7*c42dbd0eSchristos
8*c42dbd0eSchristos   Permission is granted to copy, distribute and/or modify this document
9*c42dbd0eSchristosunder the terms of the GNU Free Documentation License, Version 1.3 or
10*c42dbd0eSchristosany later version published by the Free Software Foundation; with no
11*c42dbd0eSchristosInvariant Sections, with no Front-Cover texts, and with no Back-Cover
12*c42dbd0eSchristosTexts.  A copy of the license is included in the section entitled "GNU
13*c42dbd0eSchristosFree Documentation License."
14*c42dbd0eSchristos
15*c42dbd0eSchristosINFO-DIR-SECTION Software development
16*c42dbd0eSchristosSTART-INFO-DIR-ENTRY
17*c42dbd0eSchristos* gprofng: (gprofng).                    The next generation profiling tool for Linux
18*c42dbd0eSchristosEND-INFO-DIR-ENTRY
19*c42dbd0eSchristos
20*c42dbd0eSchristos
21*c42dbd0eSchristosFile: gprofng.info,  Node: Top,  Next: Introduction,  Up: (dir)
22*c42dbd0eSchristos
23*c42dbd0eSchristosGNU Gprofng
24*c42dbd0eSchristos***********
25*c42dbd0eSchristos
26*c42dbd0eSchristosThis document is the manual for gprofng, last updated 22 February 2022.
27*c42dbd0eSchristos
28*c42dbd0eSchristos   Copyright (C) 2022 Free Software Foundation, Inc.
29*c42dbd0eSchristos
30*c42dbd0eSchristos   Permission is granted to copy, distribute and/or modify this document
31*c42dbd0eSchristosunder the terms of the GNU Free Documentation License, Version 1.3 or
32*c42dbd0eSchristosany later version published by the Free Software Foundation; with no
33*c42dbd0eSchristosInvariant Sections, with no Front-Cover texts, and with no Back-Cover
34*c42dbd0eSchristosTexts.  A copy of the license is included in the section entitled "GNU
35*c42dbd0eSchristosFree Documentation License."
36*c42dbd0eSchristos
37*c42dbd0eSchristos* Menu:
38*c42dbd0eSchristos
39*c42dbd0eSchristos* Introduction::           About this manual.
40*c42dbd0eSchristos* Overview::               A brief overview of gprofng.
41*c42dbd0eSchristos* A Mini Tutorial::        A short tutorial covering the key features.
42*c42dbd0eSchristos* Terminology::            Various concepts and some terminology explained.
43*c42dbd0eSchristos* Other Document Formats:: How to create this document in other formats.
44*c42dbd0eSchristos* Index::                  The index.
45*c42dbd0eSchristos
46*c42dbd0eSchristos
47*c42dbd0eSchristos-- The Detailed Node Listing --
48*c42dbd0eSchristos
49*c42dbd0eSchristosIntroduction
50*c42dbd0eSchristos
51*c42dbd0eSchristosOverview
52*c42dbd0eSchristos
53*c42dbd0eSchristos* Main Features::                     A high level overview.
54*c42dbd0eSchristos* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
55*c42dbd0eSchristos* Steps Needed to Create a Profile::  How to create a profile.
56*c42dbd0eSchristos
57*c42dbd0eSchristosA Mini Tutorial
58*c42dbd0eSchristos
59*c42dbd0eSchristos* Getting Started::                 The basics of profiling with gprofng().
60*c42dbd0eSchristos* Support for Multithreading::      Commands specific to multithreaded applications.
61*c42dbd0eSchristos* Viewing Multiple Experiments::    Analyze multiple experiments.
62*c42dbd0eSchristos* Profile Hardware Event Counters:: How to use hardware event counters.
63*c42dbd0eSchristos* Java Profiling::                  How to profile a Java application.
64*c42dbd0eSchristos
65*c42dbd0eSchristosTerminology
66*c42dbd0eSchristos
67*c42dbd0eSchristos* The Program Counter::                    What is a Program Counter?
68*c42dbd0eSchristos* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
69*c42dbd0eSchristos* Metric Definitions::                     Definitions associated with metrics.
70*c42dbd0eSchristos* The Viewmode::                           Select the way call stacks are presented.
71*c42dbd0eSchristos* The Selection List::                     How to define a selection.
72*c42dbd0eSchristos* Load Objects and Functions::             The components in an application.
73*c42dbd0eSchristos* The Concept of a CPU in gprofng:: The definition of a CPU.
74*c42dbd0eSchristos* Hardware Event Counters Explained::      What are event counters?
75*c42dbd0eSchristos* apath::                                  Our generic definition of a path.
76*c42dbd0eSchristos
77*c42dbd0eSchristos
78*c42dbd0eSchristos
79*c42dbd0eSchristos
80*c42dbd0eSchristosFile: gprofng.info,  Node: Introduction,  Next: Overview,  Prev: Top,  Up: Top
81*c42dbd0eSchristos
82*c42dbd0eSchristos1 Introduction
83*c42dbd0eSchristos**************
84*c42dbd0eSchristos
85*c42dbd0eSchristosThe gprofng tool is the next generation profiler for Linux.  It consists
86*c42dbd0eSchristosof various commands to generate and display profile information.
87*c42dbd0eSchristos
88*c42dbd0eSchristos   This manual starts with a tutorial how to create and interpret a
89*c42dbd0eSchristosprofile.  This part is highly practical and has the goal to get users up
90*c42dbd0eSchristosto speed as quickly as possible.  As soon as possible, we would like to
91*c42dbd0eSchristosshow you how to get your first profile on your screen.
92*c42dbd0eSchristos
93*c42dbd0eSchristos   This is followed by more examples, covering many of the features.  At
94*c42dbd0eSchristosthe end of this tutorial, you should feel confident enough to tackle the
95*c42dbd0eSchristosmore complex tasks.
96*c42dbd0eSchristos
97*c42dbd0eSchristos   In a future update a more formal reference manual will be included as
98*c42dbd0eSchristoswell.  Since even in this tutorial we use certain terminology, we have
99*c42dbd0eSchristosincluded a chapter with descriptions at the end.  In case you encounter
100*c42dbd0eSchristosunfamiliar wordings or terminology, please check this chapter.
101*c42dbd0eSchristos
102*c42dbd0eSchristos   One word of caution.  In several cases we had to somewhat tweak the
103*c42dbd0eSchristosscreen output in order to make it fit.  This is why the output may look
104*c42dbd0eSchristossomewhat different when you try things yourself.
105*c42dbd0eSchristos
106*c42dbd0eSchristos   For now, we wish you a smooth profiling experience with gprofng and
107*c42dbd0eSchristosgood luck tackling performance bottlenecks.
108*c42dbd0eSchristos
109*c42dbd0eSchristos
110*c42dbd0eSchristosFile: gprofng.info,  Node: Overview,  Next: A Mini Tutorial,  Prev: Introduction,  Up: Top
111*c42dbd0eSchristos
112*c42dbd0eSchristos2 A Brief Overview of gprofng
113*c42dbd0eSchristos*****************************
114*c42dbd0eSchristos
115*c42dbd0eSchristos* Menu:
116*c42dbd0eSchristos
117*c42dbd0eSchristos* Main Features::                     A high level overview.
118*c42dbd0eSchristos* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
119*c42dbd0eSchristos* Steps Needed to Create a Profile::  How to create a profile.
120*c42dbd0eSchristos
121*c42dbd0eSchristosBefore we cover this tool in quite some detail, we start with a brief
122*c42dbd0eSchristosoverview of what it is, and the main features.  Since we know that many
123*c42dbd0eSchristosof you would like to get started rightaway, already in this first
124*c42dbd0eSchristoschapter we explain the basics of profiling with 'gprofng'.
125*c42dbd0eSchristos
126*c42dbd0eSchristos
127*c42dbd0eSchristosFile: gprofng.info,  Node: Main Features,  Next: Sampling versus Tracing,  Up: Overview
128*c42dbd0eSchristos
129*c42dbd0eSchristos2.1 Main Features
130*c42dbd0eSchristos=================
131*c42dbd0eSchristos
132*c42dbd0eSchristosThese are the main features of the gprofng tool:
133*c42dbd0eSchristos
134*c42dbd0eSchristos   * Profiling is supported for an application written in C, C++, Java,
135*c42dbd0eSchristos     or Scala.
136*c42dbd0eSchristos
137*c42dbd0eSchristos   * Shared libraries are supported.  The information is presented at
138*c42dbd0eSchristos     the instruction level.
139*c42dbd0eSchristos
140*c42dbd0eSchristos   * The following multithreading programming models are supported:
141*c42dbd0eSchristos     Pthreads, OpenMP, and Java threads.
142*c42dbd0eSchristos
143*c42dbd0eSchristos   * This tool works with unmodified production level executables.
144*c42dbd0eSchristos     There is no need to recompile the code, but if the '-g' option has
145*c42dbd0eSchristos     been used when building the application, source line level
146*c42dbd0eSchristos     information is available.
147*c42dbd0eSchristos
148*c42dbd0eSchristos   * The focus is on support for code generated with the 'gcc' compiler,
149*c42dbd0eSchristos     but there is some limited support for the 'icc' compiler as well.
150*c42dbd0eSchristos     Future improvements and enhancements will focus on 'gcc' though.
151*c42dbd0eSchristos
152*c42dbd0eSchristos   * Processors from Intel, AMD, and Arm are supported, but the level of
153*c42dbd0eSchristos     support depends on the architectural details.  In particular,
154*c42dbd0eSchristos     hardware event counters may not be supported.
155*c42dbd0eSchristos
156*c42dbd0eSchristos   * Several views into the data are supported.  For example, a function
157*c42dbd0eSchristos     overview where the time is spent, but also a source line,
158*c42dbd0eSchristos     disassembly, call tree and a caller-callees overview are available.
159*c42dbd0eSchristos
160*c42dbd0eSchristos   * Through filters, the user can zoom in on an area of interest.
161*c42dbd0eSchristos
162*c42dbd0eSchristos   * Two or more profiles can be aggregated, or used in a comparison.
163*c42dbd0eSchristos     This comparison can be obtained at the function, source line, and
164*c42dbd0eSchristos     disassembly level.
165*c42dbd0eSchristos
166*c42dbd0eSchristos   * Through a scripting language, and customization of the metrics
167*c42dbd0eSchristos     shown, the generation and creation of a profile can be fully
168*c42dbd0eSchristos     automated and provide tailored output.
169*c42dbd0eSchristos
170*c42dbd0eSchristos
171*c42dbd0eSchristosFile: gprofng.info,  Node: Sampling versus Tracing,  Next: Steps Needed to Create a Profile,  Prev: Main Features,  Up: Overview
172*c42dbd0eSchristos
173*c42dbd0eSchristos2.2 Sampling versus Tracing
174*c42dbd0eSchristos===========================
175*c42dbd0eSchristos
176*c42dbd0eSchristosA key difference with some other profiling tools is that the main data
177*c42dbd0eSchristoscollection command 'gprofng collect app' mostly uses Program Counter
178*c42dbd0eSchristos(PC) sampling under the hood.
179*c42dbd0eSchristos
180*c42dbd0eSchristos   With _sampling_, the executable is stopped at regular intervals.
181*c42dbd0eSchristosEach time it is halted, key information is gathered and stored.  This
182*c42dbd0eSchristosincludes the Program Counter that keeps track of where the execution is.
183*c42dbd0eSchristosHence the name.
184*c42dbd0eSchristos
185*c42dbd0eSchristos   Together with operational data, this information is stored in the
186*c42dbd0eSchristosexperiment directory and can be viewed in the second phase.
187*c42dbd0eSchristos
188*c42dbd0eSchristos   For example, the PC information is used to derive where the program
189*c42dbd0eSchristoswas when it was halted.  Since the sampling interval is known, it is
190*c42dbd0eSchristosrelatively easy to derive how much time was spent in the various parts
191*c42dbd0eSchristosof the program.
192*c42dbd0eSchristos
193*c42dbd0eSchristos   The opposite technique is generally referred to as _tracing_.  With
194*c42dbd0eSchristostracing, the target is instrumented with specific calls that collect the
195*c42dbd0eSchristosrequested information.
196*c42dbd0eSchristos
197*c42dbd0eSchristos   These are some of the pros and cons of PC sampling verus tracing:
198*c42dbd0eSchristos
199*c42dbd0eSchristos   * Since there is no need to recompile, existing executables can be
200*c42dbd0eSchristos     used and the profile measures the behaviour of exactly the same
201*c42dbd0eSchristos     executable that is used in production runs.
202*c42dbd0eSchristos
203*c42dbd0eSchristos     With sampling, one inherently profiles a different executable
204*c42dbd0eSchristos     because the calls to the instrumentation library may affect the
205*c42dbd0eSchristos     compiler optimizations and run time behaviour.
206*c42dbd0eSchristos
207*c42dbd0eSchristos   * With sampling, there are very few restrictions on what can be
208*c42dbd0eSchristos     profiled and even without access to the source code, a basic
209*c42dbd0eSchristos     profile can be made.
210*c42dbd0eSchristos
211*c42dbd0eSchristos   * A downside of sampling is that, depending on the sampling
212*c42dbd0eSchristos     frequency, small functions may be missed or not captured
213*c42dbd0eSchristos     accurately.  Although this is rare, this may happen and is the
214*c42dbd0eSchristos     reason why the user has control over the sampling rate.
215*c42dbd0eSchristos
216*c42dbd0eSchristos   * While tracing produces precise information, sampling is statistical
217*c42dbd0eSchristos     in nature.  As a result, small variations may occur across
218*c42dbd0eSchristos     seemingly identical runs.  We have not observed more than a few
219*c42dbd0eSchristos     percent deviation though.  Especially if the target job executed
220*c42dbd0eSchristos     for a sufficiently long time.
221*c42dbd0eSchristos
222*c42dbd0eSchristos   * With sampling, it is not possible to get an accurate count how
223*c42dbd0eSchristos     often functions are called.
224*c42dbd0eSchristos
225*c42dbd0eSchristos
226*c42dbd0eSchristosFile: gprofng.info,  Node: Steps Needed to Create a Profile,  Prev: Sampling versus Tracing,  Up: Overview
227*c42dbd0eSchristos
228*c42dbd0eSchristos2.3 Steps Needed to Create a Profile
229*c42dbd0eSchristos====================================
230*c42dbd0eSchristos
231*c42dbd0eSchristosCreating a profile takes two steps.  First the profile data needs to be
232*c42dbd0eSchristosgenerated.  This is followed by a viewing step to create a report from
233*c42dbd0eSchristosthe information that has been gathered.
234*c42dbd0eSchristos
235*c42dbd0eSchristos   Every gprofng command starts with 'gprofng', the name of the driver.
236*c42dbd0eSchristosThis is followed by a keyword to define the high level functionality.
237*c42dbd0eSchristosDepending on this keyword, a third qualifier may be needed to further
238*c42dbd0eSchristosnarrow down the request.  This combination is then followed by options
239*c42dbd0eSchristosthat are specific to the functionality desired.
240*c42dbd0eSchristos
241*c42dbd0eSchristos   The command to gather, or "collect", the performance data is called
242*c42dbd0eSchristos'gprofng collect app'.  Aside from numerous options, this command takes
243*c42dbd0eSchristosthe name of the target executable as an input parameter.
244*c42dbd0eSchristos
245*c42dbd0eSchristos   Upon completion of the run, the performance data can be found in the
246*c42dbd0eSchristosnewly created experiment directory.
247*c42dbd0eSchristos
248*c42dbd0eSchristos   Unless explicitly specified otherwise, a default name for this
249*c42dbd0eSchristosdirectory is chosen.  The name is 'test.<n>.er' where 'n' is the first
250*c42dbd0eSchristosinteger number not in use yet for such a name.
251*c42dbd0eSchristos
252*c42dbd0eSchristos   For example, the first time 'gprofng collect app' is invoked, an
253*c42dbd0eSchristosexperiment directory with the name 'test.1.er' is created.
254*c42dbd0eSchristos
255*c42dbd0eSchristos   Upon a subsequent invocation of 'gprofng collect app' in the same
256*c42dbd0eSchristosdirectory, an experiment directory with the name 'test.2.er' will be
257*c42dbd0eSchristoscreated, and so forth.
258*c42dbd0eSchristos
259*c42dbd0eSchristos   Note that 'gprofng collect app' supports an option to explicitly name
260*c42dbd0eSchristosthe experiment directory.  Outside of the restriction that the name of
261*c42dbd0eSchristosthis directory has to end with '.er', any valid directory name can be
262*c42dbd0eSchristosused for this.
263*c42dbd0eSchristos
264*c42dbd0eSchristos   Now that we have the performance data, the next step is to display
265*c42dbd0eSchristosit.
266*c42dbd0eSchristos
267*c42dbd0eSchristos   The most commonly used command to view the performance information is
268*c42dbd0eSchristos'gprofng display text'.  This is a very extensive and customizable tool
269*c42dbd0eSchristosthat produces the information in ASCII format.
270*c42dbd0eSchristos
271*c42dbd0eSchristos   Another option is to use 'gprofng display html'.  This tool generates
272*c42dbd0eSchristosa directory with files in html format.  These can be viewed in a
273*c42dbd0eSchristosbrowser, allowing for easy navigation through the profile data.
274*c42dbd0eSchristos
275*c42dbd0eSchristos
276*c42dbd0eSchristosFile: gprofng.info,  Node: A Mini Tutorial,  Next: Terminology,  Prev: Overview,  Up: Top
277*c42dbd0eSchristos
278*c42dbd0eSchristos3 A Mini Tutorial
279*c42dbd0eSchristos*****************
280*c42dbd0eSchristos
281*c42dbd0eSchristosIn this chapter we present and discuss the main functionality of
282*c42dbd0eSchristos'gprofng'.  This will be a practical approach, using an example code to
283*c42dbd0eSchristosgenerate profile data and show how to get various performance reports.
284*c42dbd0eSchristos
285*c42dbd0eSchristos* Menu:
286*c42dbd0eSchristos
287*c42dbd0eSchristos* Getting Started::                 The basics of profiling with gprofng().
288*c42dbd0eSchristos* Support for Multithreading::      Commands specific to multithreaded applications.
289*c42dbd0eSchristos* Viewing Multiple Experiments::    Analyze multiple experiments.
290*c42dbd0eSchristos* Profile Hardware Event Counters:: How to use hardware event counters.
291*c42dbd0eSchristos* Java Profiling::                  How to profile a Java application.
292*c42dbd0eSchristos
293*c42dbd0eSchristos
294*c42dbd0eSchristosFile: gprofng.info,  Node: Getting Started,  Next: Support for Multithreading,  Up: A Mini Tutorial
295*c42dbd0eSchristos
296*c42dbd0eSchristos3.1 Getting Started
297*c42dbd0eSchristos===================
298*c42dbd0eSchristos
299*c42dbd0eSchristosThe information presented here provides a good and common basis for many
300*c42dbd0eSchristosprofiling tasks, but there are more features that you may want to
301*c42dbd0eSchristosleverage.
302*c42dbd0eSchristos
303*c42dbd0eSchristos   These are covered in subsequent sections in this chapter.
304*c42dbd0eSchristos
305*c42dbd0eSchristos* Menu:
306*c42dbd0eSchristos
307*c42dbd0eSchristos* The Example Program::                        A description of the example program used.
308*c42dbd0eSchristos* A First Profile::                            How to get the first profile.
309*c42dbd0eSchristos* The Source Code View::                       Display the metrics in the source code.
310*c42dbd0eSchristos* The Disassembly View::                       Display the metrics at the instruction level.
311*c42dbd0eSchristos* Display and Define the Metrics::             An example how to customize the metrics.
312*c42dbd0eSchristos* A First Customization of the Output::        An example how to customize the output.
313*c42dbd0eSchristos* Name the Experiment Directory::              Change the name of the experiment directory.
314*c42dbd0eSchristos* Control the Number of Lines in the Output::  Change the number of lines in the tables.
315*c42dbd0eSchristos* Sorting the Performance Data::               How to set the metric to sort by.
316*c42dbd0eSchristos* Scripting::                                  Use a script to execute the commands.
317*c42dbd0eSchristos* A More Elaborate Example::                   An example of customization.
318*c42dbd0eSchristos* The Call Tree::                              Display the dynamic call tree.
319*c42dbd0eSchristos* More Information on the Experiment::         How to get additional statistics.
320*c42dbd0eSchristos* Control the Sampling Frequency::             How to control the sampling granularity.
321*c42dbd0eSchristos* Information on Load Objects::                How to get more information on load objects.
322*c42dbd0eSchristos
323*c42dbd0eSchristos
324*c42dbd0eSchristosFile: gprofng.info,  Node: The Example Program,  Next: A First Profile,  Up: Getting Started
325*c42dbd0eSchristos
326*c42dbd0eSchristos3.1.1 The Example Program
327*c42dbd0eSchristos-------------------------
328*c42dbd0eSchristos
329*c42dbd0eSchristosThroughout this guide we use the same example C code that implements the
330*c42dbd0eSchristosmultiplication of a vector of length n by an m by n matrix.  The result
331*c42dbd0eSchristosis stored in a vector of length m.  The algorithm has been parallelized
332*c42dbd0eSchristosusing Posix Threads, or Pthreads for short.
333*c42dbd0eSchristos
334*c42dbd0eSchristos   The code was built using the 'gcc' compiler and the name of the
335*c42dbd0eSchristosexecutable is mxv-pthreads.exe.
336*c42dbd0eSchristos
337*c42dbd0eSchristos   The matrix sizes can be set through the '-m' and '-n' options.  The
338*c42dbd0eSchristosnumber of threads is set with the '-t' option.  To increase the duration
339*c42dbd0eSchristosof the run, the multiplication is executed repeatedly.
340*c42dbd0eSchristos
341*c42dbd0eSchristos   This is an example that multiplies a 3000 by 2000 matrix with a
342*c42dbd0eSchristosvector of length 2000 using 2 threads:
343*c42dbd0eSchristos
344*c42dbd0eSchristos     $ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2
345*c42dbd0eSchristos     mxv: error check passed - rows = 3000 columns = 2000 threads = 2
346*c42dbd0eSchristos     $
347*c42dbd0eSchristos
348*c42dbd0eSchristos   The program performs an internal check to verify the results are
349*c42dbd0eSchristoscorrect.  The result of this check is printed, followed by the matrix
350*c42dbd0eSchristossizes and the number of threads used.
351*c42dbd0eSchristos
352*c42dbd0eSchristos
353*c42dbd0eSchristosFile: gprofng.info,  Node: A First Profile,  Next: The Source Code View,  Prev: The Example Program,  Up: Getting Started
354*c42dbd0eSchristos
355*c42dbd0eSchristos3.1.2 A First Profile
356*c42dbd0eSchristos---------------------
357*c42dbd0eSchristos
358*c42dbd0eSchristosThe first step is to collect the performance data.  It is important to
359*c42dbd0eSchristosremember that much more information is gathered than may be shown by
360*c42dbd0eSchristosdefault.  Often a single data collection run is sufficient to get a lot
361*c42dbd0eSchristosof insight.
362*c42dbd0eSchristos
363*c42dbd0eSchristos   The 'gprofng collect app' command is used for the data collection.
364*c42dbd0eSchristosNothing needs to be changed in the way the application is executed.  The
365*c42dbd0eSchristosonly difference is that it is now run under control of the tool, as
366*c42dbd0eSchristosshown below:
367*c42dbd0eSchristos
368*c42dbd0eSchristos     $ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1
369*c42dbd0eSchristos
370*c42dbd0eSchristos   This command produces the following output:
371*c42dbd0eSchristos
372*c42dbd0eSchristos     Creating experiment database test.1.er (Process ID: 2416504) ...
373*c42dbd0eSchristos     mxv: error check passed - rows = 3000 columns = 2000 threads = 1
374*c42dbd0eSchristos
375*c42dbd0eSchristos   We see the message that a directory with the name 'test.1.er' has
376*c42dbd0eSchristosbeen created.  The application then completes as usual and we have our
377*c42dbd0eSchristosfirst experiment directory that can be analyzed.
378*c42dbd0eSchristos
379*c42dbd0eSchristos   The tool we use for this is called 'gprofng display text'.  It takes
380*c42dbd0eSchristosthe name of the experiment directory as an argument.
381*c42dbd0eSchristos
382*c42dbd0eSchristos   If invoked this way, the tool starts in the interactive _interpreter_
383*c42dbd0eSchristosmode.  While in this environment, commands can be given and the tool
384*c42dbd0eSchristosresponds.  This is illustrated below:
385*c42dbd0eSchristos
386*c42dbd0eSchristos     $ gprofng display text test.1.er
387*c42dbd0eSchristos     Warning: History and command editing is not supported on this system.
388*c42dbd0eSchristos     (gp-display-text) quit
389*c42dbd0eSchristos     $
390*c42dbd0eSchristos
391*c42dbd0eSchristos   While useful in certain cases, we prefer to use this tool in command
392*c42dbd0eSchristosline mode, by specifying the commands to be issued when invoking the
393*c42dbd0eSchristostool.  The way to do this is to prepend the command with a hyphen ('-')
394*c42dbd0eSchristosif used on the command line.
395*c42dbd0eSchristos
396*c42dbd0eSchristos   For example, with the 'functions' command we request a list of the
397*c42dbd0eSchristosfunctions that have been executed and their respective CPU times:
398*c42dbd0eSchristos
399*c42dbd0eSchristos     $ gprofng display text -functions test.1.er
400*c42dbd0eSchristos
401*c42dbd0eSchristos     $ gprofng display text -functions test.1.er
402*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
403*c42dbd0eSchristos
404*c42dbd0eSchristos     Excl.     Incl.      Name
405*c42dbd0eSchristos     Total     Total
406*c42dbd0eSchristos     CPU sec.  CPU sec.
407*c42dbd0eSchristos     2.272     2.272      <Total>
408*c42dbd0eSchristos     2.160     2.160      mxv_core
409*c42dbd0eSchristos     0.047     0.103      init_data
410*c42dbd0eSchristos     0.030     0.043      erand48_r
411*c42dbd0eSchristos     0.013     0.013      __drand48_iterate
412*c42dbd0eSchristos     0.013     0.056      drand48
413*c42dbd0eSchristos     0.008     0.010      _int_malloc
414*c42dbd0eSchristos     0.001     0.001      brk
415*c42dbd0eSchristos     0.001     0.002      sysmalloc
416*c42dbd0eSchristos     0.        0.001      __default_morecore
417*c42dbd0eSchristos     0.        0.113      __libc_start_main
418*c42dbd0eSchristos     0.        0.010      allocate_data
419*c42dbd0eSchristos     0.        2.160      collector_root
420*c42dbd0eSchristos     0.        2.160      driver_mxv
421*c42dbd0eSchristos     0.        0.113      main
422*c42dbd0eSchristos     0.        0.010      malloc
423*c42dbd0eSchristos     0.        0.001      sbrk
424*c42dbd0eSchristos
425*c42dbd0eSchristos   As easy and simple as these steps are, we do have a first profile of
426*c42dbd0eSchristosour program!  There are three columns.  The first two contain the _Total
427*c42dbd0eSchristosCPU Time_, which is the sum of the user and system time.  *Note
428*c42dbd0eSchristosInclusive and Exclusive Metrics:: for an explanation of "exclusive" and
429*c42dbd0eSchristos"inclusive" times.
430*c42dbd0eSchristos
431*c42dbd0eSchristos   The first line echoes the metric that is used to sort the output.  By
432*c42dbd0eSchristosdefault, this is the exclusive CPU time, but the sort metric can be
433*c42dbd0eSchristoschanged by the user.
434*c42dbd0eSchristos
435*c42dbd0eSchristos   We then see three columns with the exclusive and inclusive CPU times,
436*c42dbd0eSchristosplus the name of the function.
437*c42dbd0eSchristos
438*c42dbd0eSchristos   The function with the name '<Total>' is not a user function, but is
439*c42dbd0eSchristosintroduced by 'gprofng' and is used to display the accumulated metric
440*c42dbd0eSchristosvalues.  In this case, we see that the total CPU time of this job was
441*c42dbd0eSchristos'2.272' seconds.
442*c42dbd0eSchristos
443*c42dbd0eSchristos   With '2.160' seconds, function 'mxv_core' is the most time consuming
444*c42dbd0eSchristosfunction.  It is also a leaf function.
445*c42dbd0eSchristos
446*c42dbd0eSchristos   The next function in the list is 'init_data'.  Although the CPU time
447*c42dbd0eSchristosspent in this part is negligible, this is an interesting entry because
448*c42dbd0eSchristosthe inclusive CPU time of '0.103' seconds is higher than the exclusive
449*c42dbd0eSchristosCPU time of '0.047' seconds.  Clearly it is calling another function, or
450*c42dbd0eSchristoseven more than one function.  *Note The Call Tree:: for the details how
451*c42dbd0eSchristosto get more information on this.
452*c42dbd0eSchristos
453*c42dbd0eSchristos   The function 'collector_root' does not look familiar.  It is one of
454*c42dbd0eSchristosthe internal functions used by 'gprofng collect app' and can be ignored.
455*c42dbd0eSchristosWhile the inclusive time is high, the exclusive time is zero.  This
456*c42dbd0eSchristosmeans it doesn't contribute to the performance.
457*c42dbd0eSchristos
458*c42dbd0eSchristos   The question is how we know where this function originates from?
459*c42dbd0eSchristosThere is a very useful command to get more details on a function.  *Note
460*c42dbd0eSchristosInformation on Load Objects::.
461*c42dbd0eSchristos
462*c42dbd0eSchristos
463*c42dbd0eSchristosFile: gprofng.info,  Node: The Source Code View,  Next: The Disassembly View,  Prev: A First Profile,  Up: Getting Started
464*c42dbd0eSchristos
465*c42dbd0eSchristos3.1.3 The Source Code View
466*c42dbd0eSchristos--------------------------
467*c42dbd0eSchristos
468*c42dbd0eSchristosIn general, you would like to focus the tuning efforts on the most time
469*c42dbd0eSchristosconsuming part(s) of the program.  In this case that is easy, since
470*c42dbd0eSchristos2.160 seconds on a total of 2.272 seconds is spent in function
471*c42dbd0eSchristos'mxv_core'.  That is 95% of the total and it is time to dig deeper and
472*c42dbd0eSchristoslook at the time distribution at the source code level.
473*c42dbd0eSchristos
474*c42dbd0eSchristos   The 'source' command is used to accomplish this.  It takes the name
475*c42dbd0eSchristosof the function, not the source filename, as an argument.  This is
476*c42dbd0eSchristosdemonstrated below, where the 'gprofng display text' command is used to
477*c42dbd0eSchristosshow the annotated source listing of function 'mxv_core'.
478*c42dbd0eSchristos
479*c42dbd0eSchristos   Please note that the source code has to be compiled with the '-g'
480*c42dbd0eSchristosoption in order for the source code feature to work.  Otherwise the
481*c42dbd0eSchristoslocation can not be determined.
482*c42dbd0eSchristos
483*c42dbd0eSchristos     $ gprofng display text -source mxv_core test.1.er
484*c42dbd0eSchristos
485*c42dbd0eSchristos   The slightly modified output is as follows:
486*c42dbd0eSchristos
487*c42dbd0eSchristos     Source file: <apath>/mxv.c
488*c42dbd0eSchristos     Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
489*c42dbd0eSchristos     Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
490*c42dbd0eSchristos
491*c42dbd0eSchristos        Excl.     Incl.
492*c42dbd0eSchristos        Total     Total
493*c42dbd0eSchristos        CPU sec.  CPU sec.
494*c42dbd0eSchristos
495*c42dbd0eSchristos        <lines deleted>
496*c42dbd0eSchristos                                    <Function: mxv_core>
497*c42dbd0eSchristos        0.        0.             32. void __attribute__ ((noinline))
498*c42dbd0eSchristos                                     mxv_core (
499*c42dbd0eSchristos                                     uint64_t row_index_start,
500*c42dbd0eSchristos                                     uint64_t row_index_end,
501*c42dbd0eSchristos                                     uint64_t m, uint64_t n,
502*c42dbd0eSchristos                                     double **restrict A,
503*c42dbd0eSchristos                                     double *restrict b,
504*c42dbd0eSchristos                                     double *restrict c)
505*c42dbd0eSchristos        0.        0.             33. {
506*c42dbd0eSchristos        0.        0.             34.    for (uint64_t i=row_index_start;
507*c42dbd0eSchristos                                             i<=row_index_end; i++) {
508*c42dbd0eSchristos        0.        0.             35.       double row_sum = 0.0;
509*c42dbd0eSchristos     ## 1.687     1.687          36.       for (int64_t j=0; j<n; j++)
510*c42dbd0eSchristos        0.473     0.473          37.          row_sum += A[i][j]*b[j];
511*c42dbd0eSchristos        0.        0.             38.       c[i] = row_sum;
512*c42dbd0eSchristos                                 39.    }
513*c42dbd0eSchristos        0.        0.             40. }
514*c42dbd0eSchristos
515*c42dbd0eSchristos   The first three lines provide information on the location of the
516*c42dbd0eSchristossource file, the object file and the load object (*Note Load Objects and
517*c42dbd0eSchristosFunctions::).
518*c42dbd0eSchristos
519*c42dbd0eSchristos   Function 'mxv_core' is part of a source file that has other functions
520*c42dbd0eSchristosas well.  These functions will be shown, but without timing information.
521*c42dbd0eSchristosThey have been removed in the output shown above.
522*c42dbd0eSchristos
523*c42dbd0eSchristos   This is followed by the annotated source code listing.  The selected
524*c42dbd0eSchristosmetrics are shown first, followed by a source line number, and the
525*c42dbd0eSchristossource code.  The most time consuming line(s) are marked with the '##'
526*c42dbd0eSchristossymbol.  In this way they are easier to find.
527*c42dbd0eSchristos
528*c42dbd0eSchristos   What we see is that all of the time is spent in lines 36-37.
529*c42dbd0eSchristos
530*c42dbd0eSchristos   A related command sometimes comes handy as well.  It is called
531*c42dbd0eSchristos'lines' and displays a list of the source lines and their metrics,
532*c42dbd0eSchristosordered according to the current sort metric (*Note Sorting the
533*c42dbd0eSchristosPerformance Data::).
534*c42dbd0eSchristos
535*c42dbd0eSchristos   Below the command and the output.  For lay-out reasons, only the top
536*c42dbd0eSchristos10 is shown here and the last part of the text on some lines has been
537*c42dbd0eSchristosreplaced by dots.
538*c42dbd0eSchristos
539*c42dbd0eSchristos     $ gprofng display text -lines test.1.er
540*c42dbd0eSchristos
541*c42dbd0eSchristos     Lines sorted by metric: Exclusive Total CPU Time
542*c42dbd0eSchristos
543*c42dbd0eSchristos     Excl.     Incl.  Name
544*c42dbd0eSchristos     Total     Total
545*c42dbd0eSchristos     CPU sec.  CPU sec.
546*c42dbd0eSchristos     2.272     2.272  <Total>
547*c42dbd0eSchristos     1.687     1.687  mxv_core, line 36 in "mxv.c"
548*c42dbd0eSchristos     0.473     0.473  mxv_core, line 37 in "mxv.c"
549*c42dbd0eSchristos     0.032     0.088  init_data, line 72 in "manage_data.c"
550*c42dbd0eSchristos     0.030     0.043  <Function: erand48_r, instructions without line numbers>
551*c42dbd0eSchristos     0.013     0.013  <Function: __drand48_iterate, instructions without ...>
552*c42dbd0eSchristos     0.013     0.056  <Function: drand48, instructions without line numbers>
553*c42dbd0eSchristos     0.012     0.012  init_data, line 77 in "manage_data.c"
554*c42dbd0eSchristos     0.008     0.010  <Function: _int_malloc, instructions without ...>
555*c42dbd0eSchristos     0.003     0.003  init_data, line 71 in "manage_data.c"
556*c42dbd0eSchristos
557*c42dbd0eSchristos   What this overview immediately highlights is that the next most time
558*c42dbd0eSchristosconsuming source line takes 0.032 seconds only.  With an inclusive time
559*c42dbd0eSchristosof 0.088 seconds, it is also clear that this branch of the code does not
560*c42dbd0eSchristosimpact the performance.
561*c42dbd0eSchristos
562*c42dbd0eSchristos
563*c42dbd0eSchristosFile: gprofng.info,  Node: The Disassembly View,  Next: Display and Define the Metrics,  Prev: The Source Code View,  Up: Getting Started
564*c42dbd0eSchristos
565*c42dbd0eSchristos3.1.4 The Disassembly View
566*c42dbd0eSchristos--------------------------
567*c42dbd0eSchristos
568*c42dbd0eSchristosThe source view is very useful to obtain more insight where the time is
569*c42dbd0eSchristosspent, but sometimes this is not sufficient.  This is when the
570*c42dbd0eSchristosdisassembly view comes in.  It is activated with the 'disasm' command
571*c42dbd0eSchristosand as with the source view, it displays an annotated listing.  In this
572*c42dbd0eSchristoscase it shows the instructions with the metrics, interleaved with the
573*c42dbd0eSchristossource lines.  The instructions have a reference in square brackets ('['
574*c42dbd0eSchristosand ']') to the source line they correspond to.
575*c42dbd0eSchristos
576*c42dbd0eSchristosThis is what we get for our example:
577*c42dbd0eSchristos
578*c42dbd0eSchristos     $ gprofng display text -disasm mxv_core test.1.er
579*c42dbd0eSchristos
580*c42dbd0eSchristos     Source file: <apath>/mxv.c
581*c42dbd0eSchristos     Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
582*c42dbd0eSchristos     Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
583*c42dbd0eSchristos
584*c42dbd0eSchristos        Excl.     Incl.
585*c42dbd0eSchristos        Total     Total
586*c42dbd0eSchristos        CPU sec.  CPU sec.
587*c42dbd0eSchristos
588*c42dbd0eSchristos        <lines deleted>
589*c42dbd0eSchristos                             32. void __attribute__ ((noinline))
590*c42dbd0eSchristos                                 mxv_core (
591*c42dbd0eSchristos                                 uint64_t row_index_start,
592*c42dbd0eSchristos                                 uint64_t row_index_end,
593*c42dbd0eSchristos                                 uint64_t m, uint64_t n,
594*c42dbd0eSchristos                                 double **restrict A,
595*c42dbd0eSchristos                                 double *restrict b,
596*c42dbd0eSchristos                                 double *restrict c)
597*c42dbd0eSchristos                             33. {
598*c42dbd0eSchristos                                 <Function: mxv_core>
599*c42dbd0eSchristos        0.        0.             [33]   4021ba:  mov    0x8(%rsp),%r10
600*c42dbd0eSchristos                             34.    for (uint64_t i=row_index_start;
601*c42dbd0eSchristos                                         i<=row_index_end; i++) {
602*c42dbd0eSchristos        0.        0.             [34]   4021bf:  cmp    %rsi,%rdi
603*c42dbd0eSchristos        0.        0.             [34]   4021c2:  jbe    0x37
604*c42dbd0eSchristos        0.        0.             [34]   4021c4:  ret
605*c42dbd0eSchristos                             35.        double row_sum = 0.0;
606*c42dbd0eSchristos                             36.        for (int64_t j=0; j<n; j++)
607*c42dbd0eSchristos                             37.           row_sum += A[i][j]*b[j];
608*c42dbd0eSchristos        0.        0.             [37]   4021c5:  mov    (%r8,%rdi,8),%rdx
609*c42dbd0eSchristos        0.        0.             [36]   4021c9:  mov    $0x0,%eax
610*c42dbd0eSchristos        0.        0.             [35]   4021ce:  pxor   %xmm1,%xmm1
611*c42dbd0eSchristos        0.002     0.002          [37]   4021d2:  movsd  (%rdx,%rax,8),%xmm0
612*c42dbd0eSchristos        0.096     0.096          [37]   4021d7:  mulsd  (%r9,%rax,8),%xmm0
613*c42dbd0eSchristos        0.375     0.375          [37]   4021dd:  addsd  %xmm0,%xmm1
614*c42dbd0eSchristos     ## 1.683     1.683          [36]   4021e1:  add    $0x1,%rax
615*c42dbd0eSchristos        0.004     0.004          [36]   4021e5:  cmp    %rax,%rcx
616*c42dbd0eSchristos        0.        0.             [36]   4021e8:  jne    0xffffffffffffffea
617*c42dbd0eSchristos                             38.        c[i] = row_sum;
618*c42dbd0eSchristos        0.        0.             [38]   4021ea:  movsd  %xmm1,(%r10,%rdi,8)
619*c42dbd0eSchristos        0.        0.             [34]   4021f0:  add    $0x1,%rdi
620*c42dbd0eSchristos        0.        0.             [34]   4021f4:  cmp    %rdi,%rsi
621*c42dbd0eSchristos        0.        0.             [34]   4021f7:  jb     0xd
622*c42dbd0eSchristos        0.        0.             [35]   4021f9:  pxor   %xmm1,%xmm1
623*c42dbd0eSchristos        0.        0.             [36]   4021fd:  test   %rcx,%rcx
624*c42dbd0eSchristos        0.        0.             [36]   402200:  jne    0xffffffffffffffc5
625*c42dbd0eSchristos        0.        0.             [36]   402202:  jmp    0xffffffffffffffe8
626*c42dbd0eSchristos                             39.    }
627*c42dbd0eSchristos                             40. }
628*c42dbd0eSchristos        0.        0.             [40]   402204:  ret
629*c42dbd0eSchristos
630*c42dbd0eSchristos   For each instruction, the timing values are given and we can exactly
631*c42dbd0eSchristoswhich ones are the most expensive.  As with the source level view, the
632*c42dbd0eSchristosmost expensive instructions are market with the '##' symbol.
633*c42dbd0eSchristos
634*c42dbd0eSchristos   As illustrated below and similar to the 'lines' command, we can get
635*c42dbd0eSchristosan overview of the instructions executed by using the 'pcs' command.
636*c42dbd0eSchristos
637*c42dbd0eSchristosBelow the command and the output, which again has been restricted to 10
638*c42dbd0eSchristoslines:
639*c42dbd0eSchristos
640*c42dbd0eSchristos     $ gprofng display text -pcs test.1.er
641*c42dbd0eSchristos
642*c42dbd0eSchristos     PCs sorted by metric: Exclusive Total CPU Time
643*c42dbd0eSchristos
644*c42dbd0eSchristos     Excl.     Incl.      Name
645*c42dbd0eSchristos     Total     Total
646*c42dbd0eSchristos     CPU sec.  CPU sec.
647*c42dbd0eSchristos     2.272     2.272  <Total>
648*c42dbd0eSchristos     1.683     1.683  mxv_core + 0x00000027, line 36 in "mxv.c"
649*c42dbd0eSchristos     0.375     0.375  mxv_core + 0x00000023, line 37 in "mxv.c"
650*c42dbd0eSchristos     0.096     0.096  mxv_core + 0x0000001D, line 37 in "mxv.c"
651*c42dbd0eSchristos     0.027     0.027  init_data + 0x000000BD, line 72 in "manage_data.c"
652*c42dbd0eSchristos     0.012     0.012  init_data + 0x00000117, line 77 in "manage_data.c"
653*c42dbd0eSchristos     0.008     0.008  _int_malloc + 0x00000A45
654*c42dbd0eSchristos     0.007     0.007  erand48_r + 0x00000062
655*c42dbd0eSchristos     0.006     0.006  drand48 + 0x00000000
656*c42dbd0eSchristos     0.005     0.005  __drand48_iterate + 0x00000005
657*c42dbd0eSchristos
658*c42dbd0eSchristos
659*c42dbd0eSchristosFile: gprofng.info,  Node: Display and Define the Metrics,  Next: A First Customization of the Output,  Prev: The Disassembly View,  Up: Getting Started
660*c42dbd0eSchristos
661*c42dbd0eSchristos3.1.5 Display and Define the Metrics
662*c42dbd0eSchristos------------------------------------
663*c42dbd0eSchristos
664*c42dbd0eSchristosThe default metrics shown by 'gprofng display text' are useful, but
665*c42dbd0eSchristosthere is more recorded than displayed.  We can customize the values
666*c42dbd0eSchristosshown by defining the metrics ourselves.
667*c42dbd0eSchristos
668*c42dbd0eSchristos   There are two commands related to changing the metrics shown:
669*c42dbd0eSchristos'metric_list' and 'metrics'.
670*c42dbd0eSchristos
671*c42dbd0eSchristos   The first command shows the metrics in use, plus all the metrics that
672*c42dbd0eSchristoshave been stored as part of the experiment.  The second command may be
673*c42dbd0eSchristosused to define the metric list.
674*c42dbd0eSchristos
675*c42dbd0eSchristos   In our example we get the following values for the metrics:
676*c42dbd0eSchristos
677*c42dbd0eSchristos     $ gprofng display text -metric_list test.1.er
678*c42dbd0eSchristos
679*c42dbd0eSchristos     Current metrics: e.totalcpu:i.totalcpu:name
680*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
681*c42dbd0eSchristos     Available metrics:
682*c42dbd0eSchristos        Exclusive Total CPU Time: e.%totalcpu
683*c42dbd0eSchristos        Inclusive Total CPU Time: i.%totalcpu
684*c42dbd0eSchristos                            Size: size
685*c42dbd0eSchristos                      PC Address: address
686*c42dbd0eSchristos                            Name: name
687*c42dbd0eSchristos
688*c42dbd0eSchristos   This shows the metrics currently in use, the metric that is used to
689*c42dbd0eSchristossort the data and all the metrics that have been recorded, but are not
690*c42dbd0eSchristosnecessarily shown.
691*c42dbd0eSchristos
692*c42dbd0eSchristos   In this case, the default metrics are set to the exclusive and
693*c42dbd0eSchristosinclusive total CPU times, plus the name of the function, or load
694*c42dbd0eSchristosobject.
695*c42dbd0eSchristos
696*c42dbd0eSchristos   The 'metrics' command is used to define the metrics that need to be
697*c42dbd0eSchristosdisplayed.
698*c42dbd0eSchristos
699*c42dbd0eSchristos   For example, to display the exclusive total CPU time, both as a
700*c42dbd0eSchristosnumber and a percentage, use the following metric definition:
701*c42dbd0eSchristos'e.%totalcpu'
702*c42dbd0eSchristos
703*c42dbd0eSchristos   Since the metrics can be tailored for different views, there is a way
704*c42dbd0eSchristosto reset them to the default.  This is done through the special keyword
705*c42dbd0eSchristos'default'.
706*c42dbd0eSchristos
707*c42dbd0eSchristos
708*c42dbd0eSchristosFile: gprofng.info,  Node: A First Customization of the Output,  Next: Name the Experiment Directory,  Prev: Display and Define the Metrics,  Up: Getting Started
709*c42dbd0eSchristos
710*c42dbd0eSchristos3.1.6 A First Customization of the Output
711*c42dbd0eSchristos-----------------------------------------
712*c42dbd0eSchristos
713*c42dbd0eSchristosWith the information just given, we can customize the function overview.
714*c42dbd0eSchristosFor sake of the example, we would like to display the name of the
715*c42dbd0eSchristosfunction first, followed by the exclusive CPU time, given as an absolute
716*c42dbd0eSchristosnumber and a percentage.
717*c42dbd0eSchristos
718*c42dbd0eSchristos   Note that the commands are parsed in order of appearance.  This is
719*c42dbd0eSchristoswhy we need to define the metrics _before_ requesting the function
720*c42dbd0eSchristosoverview:
721*c42dbd0eSchristos
722*c42dbd0eSchristos     $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
723*c42dbd0eSchristos
724*c42dbd0eSchristos     Current metrics: name:e.%totalcpu
725*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
726*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
727*c42dbd0eSchristos
728*c42dbd0eSchristos     Name                Excl. Total
729*c42dbd0eSchristos                         CPU
730*c42dbd0eSchristos                          sec.      %
731*c42dbd0eSchristos      <Total>            2.272 100.00
732*c42dbd0eSchristos      mxv_core           2.160  95.04
733*c42dbd0eSchristos      init_data          0.047   2.06
734*c42dbd0eSchristos      erand48_r          0.030   1.32
735*c42dbd0eSchristos      __drand48_iterate  0.013   0.57
736*c42dbd0eSchristos      drand48            0.013   0.57
737*c42dbd0eSchristos      _int_malloc        0.008   0.35
738*c42dbd0eSchristos      brk                0.001   0.04
739*c42dbd0eSchristos      sysmalloc          0.001   0.04
740*c42dbd0eSchristos      __default_morecore 0.      0.
741*c42dbd0eSchristos      __libc_start_main  0.      0.
742*c42dbd0eSchristos      allocate_data      0.      0.
743*c42dbd0eSchristos      collector_root     0.      0.
744*c42dbd0eSchristos      driver_mxv         0.      0.
745*c42dbd0eSchristos      main               0.      0.
746*c42dbd0eSchristos      malloc             0.      0.
747*c42dbd0eSchristos      sbrk               0.      0.
748*c42dbd0eSchristos
749*c42dbd0eSchristos   This was a first and simple example how to customize the output.
750*c42dbd0eSchristosNote that we did not rerun our profiling job and merely modified the
751*c42dbd0eSchristosdisplay settings.  Below we will show other and also more advanced
752*c42dbd0eSchristosexamples of customization.
753*c42dbd0eSchristos
754*c42dbd0eSchristos
755*c42dbd0eSchristosFile: gprofng.info,  Node: Name the Experiment Directory,  Next: Control the Number of Lines in the Output,  Prev: A First Customization of the Output,  Up: Getting Started
756*c42dbd0eSchristos
757*c42dbd0eSchristos3.1.7 Name the Experiment Directory
758*c42dbd0eSchristos-----------------------------------
759*c42dbd0eSchristos
760*c42dbd0eSchristosWhen using 'gprofng collect app', the default names for experiments work
761*c42dbd0eSchristosfine, but they are quite generic.  It is often more convenient to select
762*c42dbd0eSchristosa more descriptive name.  For example, one that reflects conditions for
763*c42dbd0eSchristosthe experiment conducted.
764*c42dbd0eSchristos
765*c42dbd0eSchristos   For this, the mutually exclusive '-o' and '-O' options come in handy.
766*c42dbd0eSchristosBoth may be used to provide a name for the experiment directory, but the
767*c42dbd0eSchristosbehaviour of 'gprofng collect app' is different.
768*c42dbd0eSchristos
769*c42dbd0eSchristos   With the '-o' option, an existing experiment directory is not
770*c42dbd0eSchristosoverwritten.  You either need to explicitly remove an existing directory
771*c42dbd0eSchristosfirst, or use a name that is not in use yet.
772*c42dbd0eSchristos
773*c42dbd0eSchristos   This is in contrast with the behaviour for the '-O' option.  Any
774*c42dbd0eSchristosexisting (experiment) directory with the same name is silently
775*c42dbd0eSchristosoverwritten.
776*c42dbd0eSchristos
777*c42dbd0eSchristos   Be aware that the name of the experiment directory has to end with
778*c42dbd0eSchristos'.er'.
779*c42dbd0eSchristos
780*c42dbd0eSchristos
781*c42dbd0eSchristosFile: gprofng.info,  Node: Control the Number of Lines in the Output,  Next: Sorting the Performance Data,  Prev: Name the Experiment Directory,  Up: Getting Started
782*c42dbd0eSchristos
783*c42dbd0eSchristos3.1.8 Control the Number of Lines in the Output
784*c42dbd0eSchristos-----------------------------------------------
785*c42dbd0eSchristos
786*c42dbd0eSchristosThe 'limit <n>' command can be used to control the number of lines
787*c42dbd0eSchristosprinted in various overviews, including the function view, but it also
788*c42dbd0eSchristostakes effect for other display commands, like 'lines'.
789*c42dbd0eSchristos
790*c42dbd0eSchristos   The argument '<n>' should be a positive integer number.  It sets the
791*c42dbd0eSchristosnumber of lines in the function view.  A value of zero resets the limit
792*c42dbd0eSchristosto the default.
793*c42dbd0eSchristos
794*c42dbd0eSchristos   Be aware that the pseudo-function '<Total>' counts as a regular
795*c42dbd0eSchristosfunction.  For example 'limit 10' displays nine user level functions.
796*c42dbd0eSchristos
797*c42dbd0eSchristos
798*c42dbd0eSchristosFile: gprofng.info,  Node: Sorting the Performance Data,  Next: Scripting,  Prev: Control the Number of Lines in the Output,  Up: Getting Started
799*c42dbd0eSchristos
800*c42dbd0eSchristos3.1.9 Sorting the Performance Data
801*c42dbd0eSchristos----------------------------------
802*c42dbd0eSchristos
803*c42dbd0eSchristosThe 'sort <key>' command sets the key to be used when sorting the
804*c42dbd0eSchristosperformance data.
805*c42dbd0eSchristos
806*c42dbd0eSchristos   The key is a valid metric definition, but the visibility field (*Note
807*c42dbd0eSchristosMetric Definitions::) in the metric definition is ignored since this
808*c42dbd0eSchristosdoes not affect the outcome of the sorting operation.  For example if we
809*c42dbd0eSchristosset the sort key to 'e.totalcpu', the values will be sorted in
810*c42dbd0eSchristosdescending order with respect to the exclusive total CPU time.
811*c42dbd0eSchristos
812*c42dbd0eSchristos   The data can be sorted in reverse order by prepending the metric
813*c42dbd0eSchristosdefinition with a minus ('-') sign.  For example 'sort -e.totalcpu'.
814*c42dbd0eSchristos
815*c42dbd0eSchristos   A default metric for the sort operation has been defined and since
816*c42dbd0eSchristosthis is a persistent command, this default can be restored with
817*c42dbd0eSchristos'default' as the key.
818*c42dbd0eSchristos
819*c42dbd0eSchristos
820*c42dbd0eSchristosFile: gprofng.info,  Node: Scripting,  Next: A More Elaborate Example,  Prev: Sorting the Performance Data,  Up: Getting Started
821*c42dbd0eSchristos
822*c42dbd0eSchristos3.1.10 Scripting
823*c42dbd0eSchristos----------------
824*c42dbd0eSchristos
825*c42dbd0eSchristosAs is probably clear by now, the list with commands for 'gprofng display
826*c42dbd0eSchristostext' can be very long.  This is tedious and also error prone.  Luckily,
827*c42dbd0eSchristosthere is an easier and more elegant way to control the behaviour of this
828*c42dbd0eSchristostool.
829*c42dbd0eSchristos
830*c42dbd0eSchristos   Through the 'script' command, the name of a file with commands can be
831*c42dbd0eSchristospassed in.  These commands are parsed and executed as if they appeared
832*c42dbd0eSchristoson the command line in the same order as encountered in the file.  The
833*c42dbd0eSchristoscommands in this script file can actually be mixed with commands on the
834*c42dbd0eSchristoscommand line.
835*c42dbd0eSchristos
836*c42dbd0eSchristos   The difference between the commands in the script file and those used
837*c42dbd0eSchristoson the command line is that the latter require a leading dash ('-')
838*c42dbd0eSchristossymbol.
839*c42dbd0eSchristos
840*c42dbd0eSchristos   Comment lines are supported.  They need to start with the '#' symbol.
841*c42dbd0eSchristos
842*c42dbd0eSchristos
843*c42dbd0eSchristosFile: gprofng.info,  Node: A More Elaborate Example,  Next: The Call Tree,  Prev: Scripting,  Up: Getting Started
844*c42dbd0eSchristos
845*c42dbd0eSchristos3.1.11 A More Elaborate Example
846*c42dbd0eSchristos-------------------------------
847*c42dbd0eSchristos
848*c42dbd0eSchristosWith the information presented so far, we can customize our data
849*c42dbd0eSchristosgathering and display commands.
850*c42dbd0eSchristos
851*c42dbd0eSchristos   As an example, to reflect the name of the algorithm and the number of
852*c42dbd0eSchristosthreads that were used in the experiment, we select 'mxv.1.thr.er' as
853*c42dbd0eSchristosthe name of the experiment directory.  All we then need to do is to add
854*c42dbd0eSchristosthe '-O' option followed by this name on the command line when running
855*c42dbd0eSchristos'gprofng collect app':
856*c42dbd0eSchristos
857*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
858*c42dbd0eSchristos     $ m=3000
859*c42dbd0eSchristos     $ n=2000
860*c42dbd0eSchristos     $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
861*c42dbd0eSchristos
862*c42dbd0eSchristos   The commands to generate the profile are put into a file that we
863*c42dbd0eSchristossimply call 'my-script':
864*c42dbd0eSchristos
865*c42dbd0eSchristos     $ cat my-script
866*c42dbd0eSchristos     # This is my first gprofng script
867*c42dbd0eSchristos     # Set the metrics
868*c42dbd0eSchristos     metrics i.%totalcpu:e.%totalcpu:name
869*c42dbd0eSchristos     # Use the exclusive time to sort
870*c42dbd0eSchristos     sort e.totalcpu
871*c42dbd0eSchristos     # Limit the function list to 5 lines
872*c42dbd0eSchristos     limit 5
873*c42dbd0eSchristos     # Show the function list
874*c42dbd0eSchristos     functions
875*c42dbd0eSchristos
876*c42dbd0eSchristos   This script file is then specified as input to the 'gprofng display
877*c42dbd0eSchristostext' command that is used to display the performance information stored
878*c42dbd0eSchristosin 'mxv.1.thr.er':
879*c42dbd0eSchristos
880*c42dbd0eSchristos     $ gprofng display text -script my-script mxv.1.thr.er
881*c42dbd0eSchristos
882*c42dbd0eSchristos   The command above produces the following output:
883*c42dbd0eSchristos
884*c42dbd0eSchristos     # This is my first gprofng script
885*c42dbd0eSchristos     # Set the metrics
886*c42dbd0eSchristos     Current metrics: i.%totalcpu:e.%totalcpu:name
887*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
888*c42dbd0eSchristos     # Use the exclusive time to sort
889*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
890*c42dbd0eSchristos     # Limit the function list to 5 lines
891*c42dbd0eSchristos     Print limit set to 5
892*c42dbd0eSchristos     # Show the function list
893*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
894*c42dbd0eSchristos
895*c42dbd0eSchristos     Incl. Total   Excl. Total    Name
896*c42dbd0eSchristos     CPU           CPU
897*c42dbd0eSchristos      sec.      %   sec.      %
898*c42dbd0eSchristos     2.272 100.00  2.272 100.00   <Total>
899*c42dbd0eSchristos     2.159  95.00  2.159  95.00   mxv_core
900*c42dbd0eSchristos     0.102   4.48  0.054   2.37   init_data
901*c42dbd0eSchristos     0.035   1.54  0.025   1.10   erand48_r
902*c42dbd0eSchristos     0.048   2.11  0.013   0.57   drand48
903*c42dbd0eSchristos
904*c42dbd0eSchristos   In the first part of the output, our comment lines in the script file
905*c42dbd0eSchristosare shown.  These are interleaved with an acknowledgement message for
906*c42dbd0eSchristosthe commands.
907*c42dbd0eSchristos
908*c42dbd0eSchristos   This is followed by a profile consisting of 5 lines only.  For both
909*c42dbd0eSchristosmetrics, the percentages plus the timings are given.  The numbers are
910*c42dbd0eSchristossorted with respect to the exclusive total CPU time.
911*c42dbd0eSchristos
912*c42dbd0eSchristos   It is now immediately clear that function 'mxv_core' is responsbile
913*c42dbd0eSchristosfor 95% of the CPU time and 'init_data' takes 4.5% only.
914*c42dbd0eSchristos
915*c42dbd0eSchristos   This is also where we see sampling in action.  Although this is
916*c42dbd0eSchristosexactly the same job we profiled before, the timings are somewhat
917*c42dbd0eSchristosdifferent, but the differences are very small.
918*c42dbd0eSchristos
919*c42dbd0eSchristos
920*c42dbd0eSchristosFile: gprofng.info,  Node: The Call Tree,  Next: More Information on the Experiment,  Prev: A More Elaborate Example,  Up: Getting Started
921*c42dbd0eSchristos
922*c42dbd0eSchristos3.1.12 The Call Tree
923*c42dbd0eSchristos--------------------
924*c42dbd0eSchristos
925*c42dbd0eSchristosThe call tree shows the dynamic hierarchy of the application by
926*c42dbd0eSchristosdisplaying the functions executed and their parent.  It helps to find
927*c42dbd0eSchristosthe most expensive path in the program.
928*c42dbd0eSchristos
929*c42dbd0eSchristos   This feature is enabled through the 'calltree' command.  This is how
930*c42dbd0eSchristosto get this tree for our current experiment:
931*c42dbd0eSchristos
932*c42dbd0eSchristos     $ gprofng display text -calltree mxv.1.thr.er
933*c42dbd0eSchristos
934*c42dbd0eSchristos   This displays the following structure:
935*c42dbd0eSchristos
936*c42dbd0eSchristos     Functions Call Tree. Metric: Attributed Total CPU Time
937*c42dbd0eSchristos
938*c42dbd0eSchristos     Attr.      Name
939*c42dbd0eSchristos     Total
940*c42dbd0eSchristos     CPU sec.
941*c42dbd0eSchristos     2.272      +-<Total>
942*c42dbd0eSchristos     2.159        +-collector_root
943*c42dbd0eSchristos     2.159        |  +-driver_mxv
944*c42dbd0eSchristos     2.159        |    +-mxv_core
945*c42dbd0eSchristos     0.114        +-__libc_start_main
946*c42dbd0eSchristos     0.114          +-main
947*c42dbd0eSchristos     0.102            +-init_data
948*c42dbd0eSchristos     0.048            |  +-drand48
949*c42dbd0eSchristos     0.035            |    +-erand48_r
950*c42dbd0eSchristos     0.010            |      +-__drand48_iterate
951*c42dbd0eSchristos     0.011            +-allocate_data
952*c42dbd0eSchristos     0.011            |  +-malloc
953*c42dbd0eSchristos     0.011            |    +-_int_malloc
954*c42dbd0eSchristos     0.001            |      +-sysmalloc
955*c42dbd0eSchristos     0.001            +-check_results
956*c42dbd0eSchristos     0.001              +-malloc
957*c42dbd0eSchristos     0.001                +-_int_malloc
958*c42dbd0eSchristos
959*c42dbd0eSchristos   At first sight this may not be what you expected and some explanation
960*c42dbd0eSchristosis in place.
961*c42dbd0eSchristos
962*c42dbd0eSchristos   First of all, function 'collector_root' is internal to 'gprofng' and
963*c42dbd0eSchristosshould be hidden to the user.  This is part of a planned future
964*c42dbd0eSchristosenhancement.
965*c42dbd0eSchristos
966*c42dbd0eSchristos   Recall that the 'objects' and 'fsingle' commands are very useful to
967*c42dbd0eSchristosfind out more about load objects in general, but also to help identify
968*c42dbd0eSchristosan unknown entry in the function overview.  *Note Load Objects and
969*c42dbd0eSchristosFunctions::.
970*c42dbd0eSchristos
971*c42dbd0eSchristos   Another thing to note is that there are two main branches.  The one
972*c42dbd0eSchristosunder 'collector_root' and the second one under '__libc_start_main'.
973*c42dbd0eSchristosThis reflects the fact that we are executing a parallel program.  Even
974*c42dbd0eSchristosthough we only used one thread for this run, this is still executed in a
975*c42dbd0eSchristosseparate path.
976*c42dbd0eSchristos
977*c42dbd0eSchristos   The main, sequential part of the program is displayed under 'main'
978*c42dbd0eSchristosand shows the functions called and the time they took.
979*c42dbd0eSchristos
980*c42dbd0eSchristos   There are two things worth noting for the call tree feature:
981*c42dbd0eSchristos
982*c42dbd0eSchristos   * This is a dynamic tree and since sampling is used, it most likely
983*c42dbd0eSchristos     looks slighlty different across seemingly identical profile runs.
984*c42dbd0eSchristos     In case the run times are short, it is worth considering to use a
985*c42dbd0eSchristos     high resolution through the '-p' option.  For example to use '-p
986*c42dbd0eSchristos     hi' to increase the sampling rate.
987*c42dbd0eSchristos
988*c42dbd0eSchristos   * In case hardware event counters have been enabled (*Note Profile
989*c42dbd0eSchristos     Hardware Event Counters::), these values are also displayed in the
990*c42dbd0eSchristos     call tree view.
991*c42dbd0eSchristos
992*c42dbd0eSchristos
993*c42dbd0eSchristosFile: gprofng.info,  Node: More Information on the Experiment,  Next: Control the Sampling Frequency,  Prev: The Call Tree,  Up: Getting Started
994*c42dbd0eSchristos
995*c42dbd0eSchristos3.1.13 More Information on the Experiment
996*c42dbd0eSchristos-----------------------------------------
997*c42dbd0eSchristos
998*c42dbd0eSchristosThe experiment directory not only contains performance related data.
999*c42dbd0eSchristosSeveral system characteristics, the actually command executed, and some
1000*c42dbd0eSchristosglobal performance statistics can be displayed.
1001*c42dbd0eSchristos
1002*c42dbd0eSchristos   The 'header' command displays information about the experiment(s).
1003*c42dbd0eSchristosFor example, this is the command to extract this data from for our
1004*c42dbd0eSchristosexperiment directory:
1005*c42dbd0eSchristos
1006*c42dbd0eSchristos     $ gprofng display text -header mxv.1.thr.er
1007*c42dbd0eSchristos
1008*c42dbd0eSchristos   The above command prints the following information.  Note that some
1009*c42dbd0eSchristosof the lay-out and the information has been modified.  The textual
1010*c42dbd0eSchristoschanges are marked with the '<' and '>' symbols.
1011*c42dbd0eSchristos
1012*c42dbd0eSchristos     Experiment: mxv.1.thr.er
1013*c42dbd0eSchristos     No errors
1014*c42dbd0eSchristos     No warnings
1015*c42dbd0eSchristos     Archive command `gp-archive -n -a on
1016*c42dbd0eSchristos              --outfile <exp_dir>/archive.log <exp_dir>'
1017*c42dbd0eSchristos
1018*c42dbd0eSchristos     Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
1019*c42dbd0eSchristos     Process pid 30591, ppid 30589, pgrp 30551, sid 30468
1020*c42dbd0eSchristos     Current working directory: <cwd>
1021*c42dbd0eSchristos     Collector version: `2.36.50'; experiment version 12.4 (64-bit)
1022*c42dbd0eSchristos     Host `<hostname>', OS `Linux <version>', page size 4096,
1023*c42dbd0eSchristos          architecture `x86_64'
1024*c42dbd0eSchristos       16 CPUs, clock speed 1995 MHz.
1025*c42dbd0eSchristos       Memory: 30871514 pages @  4096 = 120591 MB.
1026*c42dbd0eSchristos     Data collection parameters:
1027*c42dbd0eSchristos       Clock-profiling, interval = 997 microsecs.
1028*c42dbd0eSchristos       Periodic sampling, 1 secs.
1029*c42dbd0eSchristos       Follow descendant processes from: fork|exec|combo
1030*c42dbd0eSchristos
1031*c42dbd0eSchristos     Experiment started <date and time>
1032*c42dbd0eSchristos
1033*c42dbd0eSchristos     Experiment Ended: 2.293162658
1034*c42dbd0eSchristos     Data Collection Duration: 2.293162658
1035*c42dbd0eSchristos
1036*c42dbd0eSchristos   The output above may assist in troubleshooting, or to verify some of
1037*c42dbd0eSchristosthe operational conditions and we recommand to include this command when
1038*c42dbd0eSchristosgenerating a profile.
1039*c42dbd0eSchristos
1040*c42dbd0eSchristos   Related to this command there is a useful option to record your own
1041*c42dbd0eSchristoscomment(s) in an experiment.  To this end, use the '-C' option on the
1042*c42dbd0eSchristos'gprofng collect app' tool to specify a comment string.  Up to ten
1043*c42dbd0eSchristoscomment lines can be included.  These comments are displayed with the
1044*c42dbd0eSchristos'header' command on the 'gprofng display text' tool.
1045*c42dbd0eSchristos
1046*c42dbd0eSchristos   The 'overview' command displays information on the experiment(s) and
1047*c42dbd0eSchristosalso shows a summary of the values for the metric(s) used.  This is an
1048*c42dbd0eSchristosexample how to use it on our newly created experiment directory:
1049*c42dbd0eSchristos
1050*c42dbd0eSchristos     $ gprofng display text -overview mxv.1.thr.er
1051*c42dbd0eSchristos
1052*c42dbd0eSchristos     Experiment(s):
1053*c42dbd0eSchristos
1054*c42dbd0eSchristos     Experiment      :mxv.1.thr.er
1055*c42dbd0eSchristos       Target        : './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
1056*c42dbd0eSchristos       Host          : <hostname> (<ISA>, Linux <version>)
1057*c42dbd0eSchristos       Start Time    : <date and time>
1058*c42dbd0eSchristos       Duration      : 2.293 Seconds
1059*c42dbd0eSchristos
1060*c42dbd0eSchristos     Metrics:
1061*c42dbd0eSchristos
1062*c42dbd0eSchristos       Experiment Duration (Seconds): [2.293]
1063*c42dbd0eSchristos       Clock Profiling
1064*c42dbd0eSchristos         [X]Total CPU Time - totalcpu (Seconds): [*2.272]
1065*c42dbd0eSchristos
1066*c42dbd0eSchristos     Notes: '*' indicates hot metrics, '[X]' indicates currently enabled
1067*c42dbd0eSchristos            metrics.
1068*c42dbd0eSchristos            The metrics command can be used to change selections. The
1069*c42dbd0eSchristos            metric_list command lists all available metrics.
1070*c42dbd0eSchristos
1071*c42dbd0eSchristos   This command provides a dashboard overview that helps to easily
1072*c42dbd0eSchristosidentify where the time is spent and in case hardware event counters are
1073*c42dbd0eSchristosused, it shows their total values.
1074*c42dbd0eSchristos
1075*c42dbd0eSchristos
1076*c42dbd0eSchristosFile: gprofng.info,  Node: Control the Sampling Frequency,  Next: Information on Load Objects,  Prev: More Information on the Experiment,  Up: Getting Started
1077*c42dbd0eSchristos
1078*c42dbd0eSchristos3.1.14 Control the Sampling Frequency
1079*c42dbd0eSchristos-------------------------------------
1080*c42dbd0eSchristos
1081*c42dbd0eSchristosSo far we did not talk about the frequency of the sampling process, but
1082*c42dbd0eSchristosin some cases it is useful to change the default of 10 milliseconds.
1083*c42dbd0eSchristos
1084*c42dbd0eSchristos   The advantage of increasing the sampling frequency is that functions
1085*c42dbd0eSchristosthat do not take much time per invocation are more accurately captured.
1086*c42dbd0eSchristosThe downside is that more data is gathered.  This has an impact on the
1087*c42dbd0eSchristosoverhead of the collection process and more disk space is required.
1088*c42dbd0eSchristos
1089*c42dbd0eSchristos   In general this is not an immediate concern, but with heavily
1090*c42dbd0eSchristosthreaded applications that run for an extended period of time,
1091*c42dbd0eSchristosincreasing the frequency may have a more noticeable impact.
1092*c42dbd0eSchristos
1093*c42dbd0eSchristos   The '-p' option on the 'gprofng collect app' tool is used to enable
1094*c42dbd0eSchristosor disable clock based profiling, or to explicitly set the sampling
1095*c42dbd0eSchristosrate.  This option takes one of the following keywords:
1096*c42dbd0eSchristos
1097*c42dbd0eSchristos'off'
1098*c42dbd0eSchristos     Disable clock based profiling.
1099*c42dbd0eSchristos
1100*c42dbd0eSchristos'on'
1101*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
1102*c42dbd0eSchristos     10 ms.  This is the default.
1103*c42dbd0eSchristos
1104*c42dbd0eSchristos'lo'
1105*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
1106*c42dbd0eSchristos     100 ms.
1107*c42dbd0eSchristos
1108*c42dbd0eSchristos'hi'
1109*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
1110*c42dbd0eSchristos     1 ms.
1111*c42dbd0eSchristos
1112*c42dbd0eSchristos'<value>'
1113*c42dbd0eSchristos     Enable clock based profiling with a per thread sampling interval of
1114*c42dbd0eSchristos     <value>.
1115*c42dbd0eSchristos
1116*c42dbd0eSchristos   One may wonder why there is an option to disable clock based
1117*c42dbd0eSchristosprofiling.  This is because by default, it is enabled when conducting
1118*c42dbd0eSchristoshardware event counter experiments (*Note Profile Hardware Event
1119*c42dbd0eSchristosCounters::).  With the '-p off' option, this can be disabled.
1120*c42dbd0eSchristos
1121*c42dbd0eSchristos   If an explicit value is set for the sampling, the number can be an
1122*c42dbd0eSchristosinteger or a floating-point number.  A suffix of 'u' for microseconds,
1123*c42dbd0eSchristosor 'm' for milliseconds is supported.  If no suffix is used, the value
1124*c42dbd0eSchristosis assumed to be in milliseconds.
1125*c42dbd0eSchristos
1126*c42dbd0eSchristos   If the value is smaller than the clock profiling minimum, a warning
1127*c42dbd0eSchristosmessage is issued and it is set to the minimum.  In case it is not a
1128*c42dbd0eSchristosmultiple of the clock profiling resolution, it is silently rounded down
1129*c42dbd0eSchristosto the nearest multiple of the clock resolution.
1130*c42dbd0eSchristos
1131*c42dbd0eSchristos   If the value exceeds the clock profiling maximum, is negative, or
1132*c42dbd0eSchristoszero, an error is reported.
1133*c42dbd0eSchristos
1134*c42dbd0eSchristos   Note that the 'header' command echoes the sampling rate used.
1135*c42dbd0eSchristos
1136*c42dbd0eSchristos
1137*c42dbd0eSchristosFile: gprofng.info,  Node: Information on Load Objects,  Prev: Control the Sampling Frequency,  Up: Getting Started
1138*c42dbd0eSchristos
1139*c42dbd0eSchristos3.1.15 Information on Load Objects
1140*c42dbd0eSchristos----------------------------------
1141*c42dbd0eSchristos
1142*c42dbd0eSchristosIt may happen that the function list contains a function that is not
1143*c42dbd0eSchristosknown to the user.  This can easily happen with library functions for
1144*c42dbd0eSchristosexample.  Luckily there are three commands that come in handy then.
1145*c42dbd0eSchristos
1146*c42dbd0eSchristos   These commands are 'objects', 'fsingle', and 'fsummary'.  They
1147*c42dbd0eSchristosprovide details on load objects (*Note Load Objects and Functions::).
1148*c42dbd0eSchristos
1149*c42dbd0eSchristos   The 'objects' command lists all load objects that have been
1150*c42dbd0eSchristosreferenced during the performance experiment.  Below we show the command
1151*c42dbd0eSchristosand the result for our profile job.  Like before, the (long) path names
1152*c42dbd0eSchristosin the output have been shortened and replaced by the '<apath>' symbol
1153*c42dbd0eSchristosthat represents an absolute directory path.
1154*c42dbd0eSchristos
1155*c42dbd0eSchristos     $ gprofng display text -objects mxv.1.thr.er
1156*c42dbd0eSchristos
1157*c42dbd0eSchristos   The output includes the name and path of the target executable:
1158*c42dbd0eSchristos
1159*c42dbd0eSchristos      <Unknown> (<Unknown>)
1160*c42dbd0eSchristos      <mxv-pthreads.exe> (<apath>/mxv-pthreads.exe)
1161*c42dbd0eSchristos      <librt-2.17.so> (/usr/lib64/librt-2.17.so)
1162*c42dbd0eSchristos      <libdl-2.17.so> (/usr/lib64/libdl-2.17.so)
1163*c42dbd0eSchristos      <libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>)
1164*c42dbd0eSchristos      <libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>)
1165*c42dbd0eSchristos      <libc-2.17.so> (/usr/lib64/libc-2.17.so)
1166*c42dbd0eSchristos      <libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so)
1167*c42dbd0eSchristos      <libm-2.17.so> (/usr/lib64/libm-2.17.so)
1168*c42dbd0eSchristos      <libgp-collector.so> (<apath>/libgp-collector.so)
1169*c42dbd0eSchristos      <ld-2.17.so> (/usr/lib64/ld-2.17.so)
1170*c42dbd0eSchristos      <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
1171*c42dbd0eSchristos
1172*c42dbd0eSchristos   The 'fsingle' command may be used to get more details on a specific
1173*c42dbd0eSchristosentry in the function view, say.  For example, the command below
1174*c42dbd0eSchristosprovides additional information on the 'collector_root' function shown
1175*c42dbd0eSchristosin the function overview.
1176*c42dbd0eSchristos
1177*c42dbd0eSchristos     $ gprofng display text -fsingle collector_root mxv.1.thr.er
1178*c42dbd0eSchristos
1179*c42dbd0eSchristos   Below the output from this command.  It has been somewhat modified to
1180*c42dbd0eSchristosmatch the display requirements.
1181*c42dbd0eSchristos
1182*c42dbd0eSchristos     collector_root
1183*c42dbd0eSchristos       Exclusive Total CPU Time: 0.    (  0. %)
1184*c42dbd0eSchristos       Inclusive Total CPU Time: 2.159 ( 95.0%)
1185*c42dbd0eSchristos                 Size:   401
1186*c42dbd0eSchristos           PC Address: 10:0x0001db60
1187*c42dbd0eSchristos          Source File: <apath>/dispatcher.c
1188*c42dbd0eSchristos          Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b
1189*c42dbd0eSchristos          Load Object: <apath>/libgp-collector.so
1190*c42dbd0eSchristos         Mangled Name:
1191*c42dbd0eSchristos              Aliases:
1192*c42dbd0eSchristos
1193*c42dbd0eSchristos   In this table we not only see how much time was spent in this
1194*c42dbd0eSchristosfunction, we also see where it originates from.  In addition to this,
1195*c42dbd0eSchristosthe size and start address are given as well.  If the source code
1196*c42dbd0eSchristoslocation is known it is also shown here.
1197*c42dbd0eSchristos
1198*c42dbd0eSchristos   The related 'fsummary' command displays the same information as
1199*c42dbd0eSchristos'fsingle', but for all functions in the function overview, including
1200*c42dbd0eSchristos'<Total>':
1201*c42dbd0eSchristos
1202*c42dbd0eSchristos     $ gprofng display text -fsummary mxv.1.thr.er
1203*c42dbd0eSchristos
1204*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1205*c42dbd0eSchristos
1206*c42dbd0eSchristos     <Total>
1207*c42dbd0eSchristos       Exclusive Total CPU Time: 2.272 (100.0%)
1208*c42dbd0eSchristos       Inclusive Total CPU Time: 2.272 (100.0%)
1209*c42dbd0eSchristos                 Size:     0
1210*c42dbd0eSchristos           PC Address: 1:0x00000000
1211*c42dbd0eSchristos          Source File: (unknown)
1212*c42dbd0eSchristos          Object File: (unknown)
1213*c42dbd0eSchristos          Load Object: <Total>
1214*c42dbd0eSchristos         Mangled Name:
1215*c42dbd0eSchristos              Aliases:
1216*c42dbd0eSchristos
1217*c42dbd0eSchristos     mxv_core
1218*c42dbd0eSchristos       Exclusive Total CPU Time: 2.159 ( 95.0%)
1219*c42dbd0eSchristos       Inclusive Total CPU Time: 2.159 ( 95.0%)
1220*c42dbd0eSchristos                 Size:    75
1221*c42dbd0eSchristos           PC Address: 2:0x000021ba
1222*c42dbd0eSchristos          Source File: <apath>/mxv.c
1223*c42dbd0eSchristos          Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc
1224*c42dbd0eSchristos          Load Object: <apath>/mxv-pthreads.exe
1225*c42dbd0eSchristos         Mangled Name:
1226*c42dbd0eSchristos              Aliases:
1227*c42dbd0eSchristos
1228*c42dbd0eSchristos               ... etc ...
1229*c42dbd0eSchristos
1230*c42dbd0eSchristos
1231*c42dbd0eSchristosFile: gprofng.info,  Node: Support for Multithreading,  Next: Viewing Multiple Experiments,  Prev: Getting Started,  Up: A Mini Tutorial
1232*c42dbd0eSchristos
1233*c42dbd0eSchristos3.2 Support for Multithreading
1234*c42dbd0eSchristos==============================
1235*c42dbd0eSchristos
1236*c42dbd0eSchristosIn this chapter we introduce and discuss the support for multithreading.
1237*c42dbd0eSchristosAs is shown below, nothing needs to be changed when collecting the
1238*c42dbd0eSchristosperformance data.
1239*c42dbd0eSchristos
1240*c42dbd0eSchristos   The difference is that additional commands are available to get more
1241*c42dbd0eSchristosinformation on the parallel environment, plus that several filters allow
1242*c42dbd0eSchristosthe user to zoom in on specific threads.
1243*c42dbd0eSchristos
1244*c42dbd0eSchristos* Menu:
1245*c42dbd0eSchristos
1246*c42dbd0eSchristos* Creating a Multithreading Experiment::
1247*c42dbd0eSchristos* Commands Specific to Multithreading::
1248*c42dbd0eSchristos
1249*c42dbd0eSchristos
1250*c42dbd0eSchristosFile: gprofng.info,  Node: Creating a Multithreading Experiment,  Next: Commands Specific to Multithreading,  Up: Support for Multithreading
1251*c42dbd0eSchristos
1252*c42dbd0eSchristos3.2.1 Creating a Multithreading Experiment
1253*c42dbd0eSchristos------------------------------------------
1254*c42dbd0eSchristos
1255*c42dbd0eSchristosWe demonstrate the support for multithreading using the same code and
1256*c42dbd0eSchristossettings as before, but this time we use 2 threads:
1257*c42dbd0eSchristos
1258*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
1259*c42dbd0eSchristos     $ m=3000
1260*c42dbd0eSchristos     $ n=2000
1261*c42dbd0eSchristos     $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
1262*c42dbd0eSchristos
1263*c42dbd0eSchristos   First of all, note that we did not change anything, other than
1264*c42dbd0eSchristossetting the number of threads to 2.  Nothing special is needed to
1265*c42dbd0eSchristosprofile a multithreaded job when using 'gprofng'.
1266*c42dbd0eSchristos
1267*c42dbd0eSchristos   The same is true when displaying the performance results.  The same
1268*c42dbd0eSchristoscommands that we used before work unmodified.  For example, this is all
1269*c42dbd0eSchristosthat is needed to get a function overview:
1270*c42dbd0eSchristos
1271*c42dbd0eSchristos     $ gpprofng display text -limit 10 -functions mxv.2.thr.er
1272*c42dbd0eSchristos
1273*c42dbd0eSchristos   This produces the following familiar looking output:
1274*c42dbd0eSchristos
1275*c42dbd0eSchristos     Print limit set to 10
1276*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1277*c42dbd0eSchristos
1278*c42dbd0eSchristos     Excl.     Incl.      Name
1279*c42dbd0eSchristos     Total     Total
1280*c42dbd0eSchristos     CPU sec.  CPU sec.
1281*c42dbd0eSchristos     2.268     2.268      <Total>
1282*c42dbd0eSchristos     2.155     2.155      mxv_core
1283*c42dbd0eSchristos     0.044     0.103      init_data
1284*c42dbd0eSchristos     0.030     0.046      erand48_r
1285*c42dbd0eSchristos     0.016     0.016      __drand48_iterate
1286*c42dbd0eSchristos     0.013     0.059      drand48
1287*c42dbd0eSchristos     0.008     0.011      _int_malloc
1288*c42dbd0eSchristos     0.003     0.003      brk
1289*c42dbd0eSchristos     0.        0.003      __default_morecore
1290*c42dbd0eSchristos     0.        0.114      __libc_start_main
1291*c42dbd0eSchristos
1292*c42dbd0eSchristos
1293*c42dbd0eSchristosFile: gprofng.info,  Node: Commands Specific to Multithreading,  Prev: Creating a Multithreading Experiment,  Up: Support for Multithreading
1294*c42dbd0eSchristos
1295*c42dbd0eSchristos3.2.2 Commands Specific to Multithreading
1296*c42dbd0eSchristos-----------------------------------------
1297*c42dbd0eSchristos
1298*c42dbd0eSchristosThe function overview shown above shows the results aggregated over all
1299*c42dbd0eSchristosthe threads.  The interesting new element is that we can also look at
1300*c42dbd0eSchristosthe performance data for the individual threads.
1301*c42dbd0eSchristos
1302*c42dbd0eSchristos   The 'thread_list' command displays how many threads have been used:
1303*c42dbd0eSchristos
1304*c42dbd0eSchristos     $ gprofng display text -thread_list mxv.2.thr.er
1305*c42dbd0eSchristos
1306*c42dbd0eSchristos   This produces the following output, showing that three threads have
1307*c42dbd0eSchristosbeen used:
1308*c42dbd0eSchristos
1309*c42dbd0eSchristos     Exp Sel Total
1310*c42dbd0eSchristos     === === =====
1311*c42dbd0eSchristos       1 all     3
1312*c42dbd0eSchristos
1313*c42dbd0eSchristos   The output confirms there is one experiment and that by default all
1314*c42dbd0eSchristosthreads are selected.
1315*c42dbd0eSchristos
1316*c42dbd0eSchristos   It may seem surprising to see three threads here, since we used the
1317*c42dbd0eSchristos'-t 2' option, but it is common for a Pthreads program to use one
1318*c42dbd0eSchristosadditional thread.  This is typically the thread that runs from start to
1319*c42dbd0eSchristosfinish and handles the sequential portions of the code, as well as takes
1320*c42dbd0eSchristoscare of managing the threads.
1321*c42dbd0eSchristos
1322*c42dbd0eSchristos   It is no different in our example code.  At some point, the main
1323*c42dbd0eSchristosthread creates and activates the two threads that perform the
1324*c42dbd0eSchristosmultiplication of the matrix with the vector.  Upon completion of this
1325*c42dbd0eSchristoscomputation, the main thread continues.
1326*c42dbd0eSchristos
1327*c42dbd0eSchristos   The 'threads' command is simple, yet very powerful.  It shows the
1328*c42dbd0eSchristostotal value of the metrics for each thread.  To make it easier to
1329*c42dbd0eSchristosinterpret the data, we modify the metrics to include percentages:
1330*c42dbd0eSchristos
1331*c42dbd0eSchristos     $ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er
1332*c42dbd0eSchristos
1333*c42dbd0eSchristos   The command above produces the following overview:
1334*c42dbd0eSchristos
1335*c42dbd0eSchristos     Current metrics: e.%totalcpu:name
1336*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1337*c42dbd0eSchristos     Objects sorted by metric: Exclusive Total CPU Time
1338*c42dbd0eSchristos
1339*c42dbd0eSchristos     Excl. Total    Name
1340*c42dbd0eSchristos     CPU
1341*c42dbd0eSchristos      sec.      %
1342*c42dbd0eSchristos     2.258 100.00   <Total>
1343*c42dbd0eSchristos     1.075  47.59   Process 1, Thread 3
1344*c42dbd0eSchristos     1.070  47.37   Process 1, Thread 2
1345*c42dbd0eSchristos     0.114   5.03   Process 1, Thread 1
1346*c42dbd0eSchristos
1347*c42dbd0eSchristos   The first line gives the total CPU time accumulated over the threads
1348*c42dbd0eSchristosselected.  This is followed by the metric value(s) for each thread.
1349*c42dbd0eSchristos
1350*c42dbd0eSchristos   From this it is clear that the main thread is responsible for 5% of
1351*c42dbd0eSchristosthe total CPU time, while the other two threads take 47% each.
1352*c42dbd0eSchristos
1353*c42dbd0eSchristos   This view is ideally suited to verify if there any load balancing
1354*c42dbd0eSchristosissues and also to find the most time consuming thread(s).
1355*c42dbd0eSchristos
1356*c42dbd0eSchristos   While useful, often more information than this is needed.  This is
1357*c42dbd0eSchristoswhere the thread selection filter comes in.  Through the 'thread_select'
1358*c42dbd0eSchristoscommand, one or more threads may be selected (*Note The Selection List::
1359*c42dbd0eSchristoshow to define the selection list).
1360*c42dbd0eSchristos
1361*c42dbd0eSchristos   Since it is most common to use this command in a script, we do so as
1362*c42dbd0eSchristoswell here.  Below the script we are using:
1363*c42dbd0eSchristos
1364*c42dbd0eSchristos     # Define the metrics
1365*c42dbd0eSchristos     metrics e.%totalcpu
1366*c42dbd0eSchristos     # Limit the output to 10 lines
1367*c42dbd0eSchristos     limit 10
1368*c42dbd0eSchristos     # Get the function overview for thread 1
1369*c42dbd0eSchristos     thread_select 1
1370*c42dbd0eSchristos     functions
1371*c42dbd0eSchristos     # Get the function overview for thread 2
1372*c42dbd0eSchristos     thread_select 2
1373*c42dbd0eSchristos     functions
1374*c42dbd0eSchristos     # Get the function overview for thread 3
1375*c42dbd0eSchristos     thread_select 3
1376*c42dbd0eSchristos     functions
1377*c42dbd0eSchristos
1378*c42dbd0eSchristos   The definition of the metrics and the output limiter has been shown
1379*c42dbd0eSchristosand explained before and will be ignored.  The new command we focus on
1380*c42dbd0eSchristosis 'thread_select'.
1381*c42dbd0eSchristos
1382*c42dbd0eSchristos   This command takes a list (*Note The Selection List::) to select
1383*c42dbd0eSchristosspecific threads.  In this case we simply use the individual thread
1384*c42dbd0eSchristosnumbers that we obtained with the 'thread_list' command earlier.
1385*c42dbd0eSchristos
1386*c42dbd0eSchristos   This restricts the output of the 'functions' command to the thread
1387*c42dbd0eSchristosnumber(s) specified.  This means that the script above shows which
1388*c42dbd0eSchristosfunction(s) each thread executes and how much CPU time they consumed.
1389*c42dbd0eSchristosBoth the timings and their percentages are given.
1390*c42dbd0eSchristos
1391*c42dbd0eSchristos   This is the relevant part of the output for the first thread:
1392*c42dbd0eSchristos
1393*c42dbd0eSchristos     # Get the function overview for thread 1
1394*c42dbd0eSchristos     Exp Sel Total
1395*c42dbd0eSchristos     === === =====
1396*c42dbd0eSchristos       1 1       3
1397*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1398*c42dbd0eSchristos
1399*c42dbd0eSchristos     Excl. Total    Name
1400*c42dbd0eSchristos     CPU
1401*c42dbd0eSchristos      sec.      %
1402*c42dbd0eSchristos     0.114 100.00   <Total>
1403*c42dbd0eSchristos     0.051  44.74   init_data
1404*c42dbd0eSchristos     0.028  24.56   erand48_r
1405*c42dbd0eSchristos     0.017  14.91   __drand48_iterate
1406*c42dbd0eSchristos     0.010   8.77   _int_malloc
1407*c42dbd0eSchristos     0.008   7.02   drand48
1408*c42dbd0eSchristos     0.      0.     __libc_start_main
1409*c42dbd0eSchristos     0.      0.     allocate_data
1410*c42dbd0eSchristos     0.      0.     main
1411*c42dbd0eSchristos     0.      0.     malloc
1412*c42dbd0eSchristos
1413*c42dbd0eSchristos   As usual, the comment lines are echoed.  This is followed by a
1414*c42dbd0eSchristosconfirmation of our selection.  We see that indeed thread 1 has been
1415*c42dbd0eSchristosselected.  What is displayed next is the function overview for this
1416*c42dbd0eSchristosparticular thread.  Due to the 'limit 10' command, there are ten entries
1417*c42dbd0eSchristosin this list.
1418*c42dbd0eSchristos
1419*c42dbd0eSchristos   Below are the overviews for threads 2 and 3 respectively.  We see
1420*c42dbd0eSchristosthat all of the CPU time is spent in function 'mxv_core' and that this
1421*c42dbd0eSchristostime is approximately the same for both threads.
1422*c42dbd0eSchristos
1423*c42dbd0eSchristos     # Get the function overview for thread 2
1424*c42dbd0eSchristos     Exp Sel Total
1425*c42dbd0eSchristos     === === =====
1426*c42dbd0eSchristos       1 2       3
1427*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1428*c42dbd0eSchristos
1429*c42dbd0eSchristos     Excl. Total    Name
1430*c42dbd0eSchristos     CPU
1431*c42dbd0eSchristos      sec.      %
1432*c42dbd0eSchristos     1.072 100.00   <Total>
1433*c42dbd0eSchristos     1.072 100.00   mxv_core
1434*c42dbd0eSchristos     0.      0.     collector_root
1435*c42dbd0eSchristos     0.      0.     driver_mxv
1436*c42dbd0eSchristos
1437*c42dbd0eSchristos     # Get the function overview for thread 3
1438*c42dbd0eSchristos     Exp Sel Total
1439*c42dbd0eSchristos     === === =====
1440*c42dbd0eSchristos       1 3       3
1441*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1442*c42dbd0eSchristos
1443*c42dbd0eSchristos     Excl. Total    Name
1444*c42dbd0eSchristos     CPU
1445*c42dbd0eSchristos      sec.      %
1446*c42dbd0eSchristos     1.076 100.00   <Total>
1447*c42dbd0eSchristos     1.076 100.00   mxv_core
1448*c42dbd0eSchristos     0.      0.     collector_root
1449*c42dbd0eSchristos     0.      0.     driver_mxv
1450*c42dbd0eSchristos
1451*c42dbd0eSchristos   When analyzing the performance of a multithreaded application, it is
1452*c42dbd0eSchristossometimes useful to know whether threads have mostly executed on the
1453*c42dbd0eSchristossame core, say, or if they have wandered across multiple cores.  This
1454*c42dbd0eSchristossort of stickiness is usually referred to as _thread affinity_.
1455*c42dbd0eSchristos
1456*c42dbd0eSchristos   Similar to the commands for the threads, there are several commands
1457*c42dbd0eSchristosrelated to the usage of the cores, or _CPUs_ as they are called in
1458*c42dbd0eSchristos'gprofng' (*Note The Concept of a CPU in gprofng::).
1459*c42dbd0eSchristos
1460*c42dbd0eSchristos   In order to have some more interesting data to look at, we created a
1461*c42dbd0eSchristosnew experiment, this time using 8 threads:
1462*c42dbd0eSchristos
1463*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
1464*c42dbd0eSchristos     $ m=3000
1465*c42dbd0eSchristos     $ n=2000
1466*c42dbd0eSchristos     $ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8
1467*c42dbd0eSchristos
1468*c42dbd0eSchristos   Similar to the 'thread_list' command, the 'cpu_list' command displays
1469*c42dbd0eSchristoshow many CPUs have been used.  The equivalent of the 'threads' threads
1470*c42dbd0eSchristoscommand, is the 'cpus' command, which shows the CPU numbers that were
1471*c42dbd0eSchristosused and how much time was spent on each of them.  Both are demonstrated
1472*c42dbd0eSchristosbelow.
1473*c42dbd0eSchristos
1474*c42dbd0eSchristos     $ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er
1475*c42dbd0eSchristos
1476*c42dbd0eSchristos   This command produces the following output:
1477*c42dbd0eSchristos
1478*c42dbd0eSchristos     Current metrics: e.%totalcpu:name
1479*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1480*c42dbd0eSchristos     Exp Sel Total
1481*c42dbd0eSchristos     === === =====
1482*c42dbd0eSchristos       1 all    10
1483*c42dbd0eSchristos     Objects sorted by metric: Exclusive Total CPU Time
1484*c42dbd0eSchristos
1485*c42dbd0eSchristos     Excl. Total    Name
1486*c42dbd0eSchristos     CPU
1487*c42dbd0eSchristos      sec.      %
1488*c42dbd0eSchristos     2.310 100.00   <Total>
1489*c42dbd0eSchristos     0.286  12.39   CPU 7
1490*c42dbd0eSchristos     0.284  12.30   CPU 13
1491*c42dbd0eSchristos     0.282  12.21   CPU 5
1492*c42dbd0eSchristos     0.280  12.13   CPU 14
1493*c42dbd0eSchristos     0.266  11.52   CPU 9
1494*c42dbd0eSchristos     0.265  11.48   CPU 2
1495*c42dbd0eSchristos     0.264  11.44   CPU 11
1496*c42dbd0eSchristos     0.194   8.42   CPU 0
1497*c42dbd0eSchristos     0.114   4.92   CPU 1
1498*c42dbd0eSchristos     0.074   3.19   CPU 15
1499*c42dbd0eSchristos
1500*c42dbd0eSchristos   What we see in this table is that a total of 10 CPUs have been used.
1501*c42dbd0eSchristosThis is followed by a list with all the CPU numbers that have been used
1502*c42dbd0eSchristosduring the run.  For each CPU it is shown how much time was spent on it.
1503*c42dbd0eSchristos
1504*c42dbd0eSchristos   While the table with thread times shown earlier may point at a load
1505*c42dbd0eSchristosimbalance in the application, this overview has a different purpose.
1506*c42dbd0eSchristos
1507*c42dbd0eSchristos   For example, we see that 10 CPUs have been used, but we know that the
1508*c42dbd0eSchristosapplication uses 9 threads only.  This means that at least one thread
1509*c42dbd0eSchristoshas executed on more than one CPU. In itself this is not something to
1510*c42dbd0eSchristosworry about, but warrants a deeper investigation.
1511*c42dbd0eSchristos
1512*c42dbd0eSchristos   Honesty dictates that next we performed a pre-analysis to find out
1513*c42dbd0eSchristoswhich thread(s) have been running on more than one CPU. We found this to
1514*c42dbd0eSchristosbe thread 7.  It has executed on CPUs 0 and 15.
1515*c42dbd0eSchristos
1516*c42dbd0eSchristos   With this knowledge, we wrote the script shown below.  It zooms in on
1517*c42dbd0eSchristosthe behaviour of thread 7.
1518*c42dbd0eSchristos
1519*c42dbd0eSchristos     # Define the metrics
1520*c42dbd0eSchristos     metrics e.%totalcpu
1521*c42dbd0eSchristos     # Limit the output to 10 lines
1522*c42dbd0eSchristos     limit 10
1523*c42dbd0eSchristos     functions
1524*c42dbd0eSchristos     # Get the function overview for CPU 0
1525*c42dbd0eSchristos     cpu_select 0
1526*c42dbd0eSchristos     functions
1527*c42dbd0eSchristos     # Get the function overview for CPU 15
1528*c42dbd0eSchristos     cpu_select 15
1529*c42dbd0eSchristos     functions
1530*c42dbd0eSchristos
1531*c42dbd0eSchristos   From the earlier shown threads overview, we know that thread 7 has
1532*c42dbd0eSchristosused '0.268' seconds of CPU time..
1533*c42dbd0eSchristos
1534*c42dbd0eSchristos   By selecting CPUs 0 and 15, respectively, we get the following
1535*c42dbd0eSchristosfunction overviews:
1536*c42dbd0eSchristos
1537*c42dbd0eSchristos     # Get the function overview for CPU 0
1538*c42dbd0eSchristos     Exp Sel Total
1539*c42dbd0eSchristos     === === =====
1540*c42dbd0eSchristos       1 0      10
1541*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1542*c42dbd0eSchristos
1543*c42dbd0eSchristos     Excl. Total    Name
1544*c42dbd0eSchristos     CPU
1545*c42dbd0eSchristos      sec.      %
1546*c42dbd0eSchristos     0.194 100.00   <Total>
1547*c42dbd0eSchristos     0.194 100.00   mxv_core
1548*c42dbd0eSchristos     0.      0.     collector_root
1549*c42dbd0eSchristos     0.      0.     driver_mxv
1550*c42dbd0eSchristos
1551*c42dbd0eSchristos     # Get the function overview for CPU 15
1552*c42dbd0eSchristos     Exp Sel Total
1553*c42dbd0eSchristos     === === =====
1554*c42dbd0eSchristos       1 15     10
1555*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1556*c42dbd0eSchristos
1557*c42dbd0eSchristos     Excl. Total    Name
1558*c42dbd0eSchristos     CPU
1559*c42dbd0eSchristos      sec.      %
1560*c42dbd0eSchristos     0.074 100.00   <Total>
1561*c42dbd0eSchristos     0.074 100.00   mxv_core
1562*c42dbd0eSchristos     0.      0.     collector_root
1563*c42dbd0eSchristos     0.      0.     driver_mxv
1564*c42dbd0eSchristos
1565*c42dbd0eSchristos   This shows that thread 7 spent '0.194' seconds on CPU 0 and '0.074'
1566*c42dbd0eSchristosseconds on CPU 15.
1567*c42dbd0eSchristos
1568*c42dbd0eSchristos
1569*c42dbd0eSchristosFile: gprofng.info,  Node: Viewing Multiple Experiments,  Next: Profile Hardware Event Counters,  Prev: Support for Multithreading,  Up: A Mini Tutorial
1570*c42dbd0eSchristos
1571*c42dbd0eSchristos3.3 Viewing Multiple Experiments
1572*c42dbd0eSchristos================================
1573*c42dbd0eSchristos
1574*c42dbd0eSchristosOne thing we did not cover sofar is that 'gprofng' fully supports the
1575*c42dbd0eSchristosanalysis of multiple experiments.  The 'gprofng display text' tool
1576*c42dbd0eSchristosaccepts a list of experiments.  The data can either be aggregated across
1577*c42dbd0eSchristosthe experiments, or used in a comparison.
1578*c42dbd0eSchristos
1579*c42dbd0eSchristos   Mention 'experiment_list'
1580*c42dbd0eSchristos
1581*c42dbd0eSchristos* Menu:
1582*c42dbd0eSchristos
1583*c42dbd0eSchristos* Aggregation of Experiments::
1584*c42dbd0eSchristos* Comparison of Experiments::
1585*c42dbd0eSchristos
1586*c42dbd0eSchristos
1587*c42dbd0eSchristosFile: gprofng.info,  Node: Aggregation of Experiments,  Next: Comparison of Experiments,  Up: Viewing Multiple Experiments
1588*c42dbd0eSchristos
1589*c42dbd0eSchristos3.3.1 Aggregation of Experiments
1590*c42dbd0eSchristos--------------------------------
1591*c42dbd0eSchristos
1592*c42dbd0eSchristosBy default, the data for multiple experiments is aggregrated and the
1593*c42dbd0eSchristosdisplay commands shows these combined results.
1594*c42dbd0eSchristos
1595*c42dbd0eSchristos   For example, we can aggregate the data for our single and dual thread
1596*c42dbd0eSchristosexperiments.  Below is the script we used for this:
1597*c42dbd0eSchristos
1598*c42dbd0eSchristos     # Define the metrics
1599*c42dbd0eSchristos     metrics e.%totalcpu
1600*c42dbd0eSchristos     # Limit the output to 10 lines
1601*c42dbd0eSchristos     limit 10
1602*c42dbd0eSchristos     # Get the list with experiments
1603*c42dbd0eSchristos     experiment_list
1604*c42dbd0eSchristos     # Get the function overview
1605*c42dbd0eSchristos     functions
1606*c42dbd0eSchristos
1607*c42dbd0eSchristos   With the exception of the 'experiment_list' command, all commands
1608*c42dbd0eSchristosused have been discussed earlier.
1609*c42dbd0eSchristos
1610*c42dbd0eSchristos   The 'experiment_list' command provides a list of the experiments that
1611*c42dbd0eSchristoshave been loaded.  This is is used to verify we are looking at the
1612*c42dbd0eSchristosexperiments we intend to aggregate.
1613*c42dbd0eSchristos
1614*c42dbd0eSchristos     $ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er
1615*c42dbd0eSchristos
1616*c42dbd0eSchristos   With the command above, we get the following output:
1617*c42dbd0eSchristos
1618*c42dbd0eSchristos     # Define the metrics
1619*c42dbd0eSchristos     Current metrics: e.%totalcpu:name
1620*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1621*c42dbd0eSchristos     # Limit the output to 10 lines
1622*c42dbd0eSchristos     Print limit set to 10
1623*c42dbd0eSchristos     # Get the list with experiments
1624*c42dbd0eSchristos     ID Sel   PID Experiment
1625*c42dbd0eSchristos     == === ===== ============
1626*c42dbd0eSchristos      1 yes 30591 mxv.1.thr.er
1627*c42dbd0eSchristos      2 yes 11629 mxv.2.thr.er
1628*c42dbd0eSchristos     # Get the function overview
1629*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1630*c42dbd0eSchristos
1631*c42dbd0eSchristos     Excl. Total    Name
1632*c42dbd0eSchristos     CPU
1633*c42dbd0eSchristos      sec.      %
1634*c42dbd0eSchristos     4.533 100.00   <Total>
1635*c42dbd0eSchristos     4.306  94.99   mxv_core
1636*c42dbd0eSchristos     0.105   2.31   init_data
1637*c42dbd0eSchristos     0.053   1.17   erand48_r
1638*c42dbd0eSchristos     0.027   0.59   __drand48_iterate
1639*c42dbd0eSchristos     0.021   0.46   _int_malloc
1640*c42dbd0eSchristos     0.021   0.46   drand48
1641*c42dbd0eSchristos     0.001   0.02   sysmalloc
1642*c42dbd0eSchristos     0.      0.     __libc_start_main
1643*c42dbd0eSchristos     0.      0.     allocate_data
1644*c42dbd0eSchristos
1645*c42dbd0eSchristos   The first five lines should look familiar.  The five lines following,
1646*c42dbd0eSchristosecho the comment line in the script and show the overview of the
1647*c42dbd0eSchristosexperiments.  This confirms two experiments have been loaded and that
1648*c42dbd0eSchristosboth are active.
1649*c42dbd0eSchristos
1650*c42dbd0eSchristos   This is followed by the function overview.  The timings have been
1651*c42dbd0eSchristossummed up and the percentages are adjusted accordingly.  For example,
1652*c42dbd0eSchristosthe total accumulated time is indeed 2.272 + 2.261 = 4.533 seconds.
1653*c42dbd0eSchristos
1654*c42dbd0eSchristos
1655*c42dbd0eSchristosFile: gprofng.info,  Node: Comparison of Experiments,  Prev: Aggregation of Experiments,  Up: Viewing Multiple Experiments
1656*c42dbd0eSchristos
1657*c42dbd0eSchristos3.3.2 Comparison of Experiments
1658*c42dbd0eSchristos-------------------------------
1659*c42dbd0eSchristos
1660*c42dbd0eSchristosThe support for multiple experiments really shines in comparison mode.
1661*c42dbd0eSchristosThis feature is enabled through the command 'compare on' and is disabled
1662*c42dbd0eSchristosby setting 'compare off'.
1663*c42dbd0eSchristos
1664*c42dbd0eSchristos   In comparison mode, the data for the various experiments is shown
1665*c42dbd0eSchristosside by side, as illustrated below where we compare the results for the
1666*c42dbd0eSchristosmultithreaded experiments using one and two threads respectively:
1667*c42dbd0eSchristos
1668*c42dbd0eSchristos     $ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er
1669*c42dbd0eSchristos
1670*c42dbd0eSchristosThis produces the following output:
1671*c42dbd0eSchristos
1672*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
1673*c42dbd0eSchristos
1674*c42dbd0eSchristos     mxv.1.thr.er  mxv.2.thr.er  mxv.1.thr.er  mxv.2.thr.er
1675*c42dbd0eSchristos     Excl. Total   Excl. Total   Incl. Total   Incl. Total    Name
1676*c42dbd0eSchristos     CPU           CPU           CPU           CPU
1677*c42dbd0eSchristos      sec.          sec.          sec.          sec.
1678*c42dbd0eSchristos     2.272         2.261         2.272         2.261          <Total>
1679*c42dbd0eSchristos     2.159         2.148         2.159         2.148          mxv_core
1680*c42dbd0eSchristos     0.054         0.051         0.102         0.104          init_data
1681*c42dbd0eSchristos     0.025         0.028         0.035         0.045          erand48_r
1682*c42dbd0eSchristos     0.013         0.008         0.048         0.053          drand48
1683*c42dbd0eSchristos     0.011         0.010         0.012         0.010          _int_malloc
1684*c42dbd0eSchristos     0.010         0.017         0.010         0.017          __drand48_iterate
1685*c42dbd0eSchristos     0.001         0.            0.001         0.             sysmalloc
1686*c42dbd0eSchristos     0.            0.            0.114         0.114          __libc_start_main
1687*c42dbd0eSchristos     0.            0.            0.011         0.010          allocate_data
1688*c42dbd0eSchristos     0.            0.            0.001         0.             check_results
1689*c42dbd0eSchristos     0.            0.            2.159         2.148          collector_root
1690*c42dbd0eSchristos     0.            0.            2.159         2.148          driver_mxv
1691*c42dbd0eSchristos     0.            0.            0.114         0.114          main
1692*c42dbd0eSchristos     0.            0.            0.012         0.010          malloc
1693*c42dbd0eSchristos
1694*c42dbd0eSchristos   This table is already helpful to more easily compare (two) profiles,
1695*c42dbd0eSchristosbut there is more that we can do here.
1696*c42dbd0eSchristos
1697*c42dbd0eSchristos   By default, in comparison mode, all measured values are shown.  Often
1698*c42dbd0eSchristosprofiling is about comparing performance data.  It is therefore more
1699*c42dbd0eSchristosuseful to look at differences, or ratios, using one experiment as a
1700*c42dbd0eSchristosreference.
1701*c42dbd0eSchristos
1702*c42dbd0eSchristos   The values shown are relative to this difference.  For example if a
1703*c42dbd0eSchristosratio is below one, it means the reference value was higher.
1704*c42dbd0eSchristos
1705*c42dbd0eSchristos   This feature is supported on the 'compare' command.  In addition to
1706*c42dbd0eSchristos'on', or 'off', this command also supports 'delta', or 'ratio'.
1707*c42dbd0eSchristos
1708*c42dbd0eSchristos   Usage of one of these two keywords enables the comparison feature and
1709*c42dbd0eSchristosshows either the difference, or the ratio, relative to the reference
1710*c42dbd0eSchristosdata.
1711*c42dbd0eSchristos
1712*c42dbd0eSchristos   In the example below, we use the same two experiments used in the
1713*c42dbd0eSchristoscomparison above, but as before, the number of lines is restricted to 10
1714*c42dbd0eSchristosand we focus on the exclusive timings plus percentages.  For the
1715*c42dbd0eSchristoscomparison part we are interested in the differences.
1716*c42dbd0eSchristos
1717*c42dbd0eSchristos   This is the script that produces such an overview:
1718*c42dbd0eSchristos
1719*c42dbd0eSchristos     # Define the metrics
1720*c42dbd0eSchristos     metrics e.%totalcpu
1721*c42dbd0eSchristos     # Limit the output to 10 lines
1722*c42dbd0eSchristos     limit 10
1723*c42dbd0eSchristos     # Set the comparison mode to differences
1724*c42dbd0eSchristos     compare delta
1725*c42dbd0eSchristos     # Get the function overview
1726*c42dbd0eSchristos     functions
1727*c42dbd0eSchristos
1728*c42dbd0eSchristos   Assuming this script file is called 'my-script-comp', this is how we
1729*c42dbd0eSchristosget the table displayed on our screen:
1730*c42dbd0eSchristos
1731*c42dbd0eSchristos     $ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er
1732*c42dbd0eSchristos
1733*c42dbd0eSchristos   Leaving out some of the lines printed, but we have seen before, we
1734*c42dbd0eSchristosget the following table:
1735*c42dbd0eSchristos
1736*c42dbd0eSchristos     mxv.1.thr.er  mxv.2.thr.er
1737*c42dbd0eSchristos     Excl. Total   Excl. Total     Name
1738*c42dbd0eSchristos     CPU           CPU
1739*c42dbd0eSchristos      sec.      %   delta      %
1740*c42dbd0eSchristos     2.272 100.00  -0.011 100.00   <Total>
1741*c42dbd0eSchristos     2.159  95.00  -0.011  94.97   mxv_core
1742*c42dbd0eSchristos     0.054   2.37  -0.003   2.25   init_data
1743*c42dbd0eSchristos     0.025   1.10  +0.003   1.23   erand48_r
1744*c42dbd0eSchristos     0.013   0.57  -0.005   0.35   drand48
1745*c42dbd0eSchristos     0.011   0.48  -0.001   0.44   _int_malloc
1746*c42dbd0eSchristos     0.010   0.44  +0.007   0.75   __drand48_iterate
1747*c42dbd0eSchristos     0.001   0.04  -0.001   0.     sysmalloc
1748*c42dbd0eSchristos     0.      0.    +0.      0.     __libc_start_main
1749*c42dbd0eSchristos     0.      0.    +0.      0.     allocate_data
1750*c42dbd0eSchristos
1751*c42dbd0eSchristos   It is now easy to see that the CPU times for the most time consuming
1752*c42dbd0eSchristosfunctions in this code are practically the same.
1753*c42dbd0eSchristos
1754*c42dbd0eSchristos   While in this case we used the delta as a comparison,
1755*c42dbd0eSchristos
1756*c42dbd0eSchristos   Note that the comparison feature is supported at the function,
1757*c42dbd0eSchristossource, and disassembly level.  There is no practical limit on the
1758*c42dbd0eSchristosnumber of experiments that can be used in a comparison.
1759*c42dbd0eSchristos
1760*c42dbd0eSchristos
1761*c42dbd0eSchristosFile: gprofng.info,  Node: Profile Hardware Event Counters,  Next: Java Profiling,  Prev: Viewing Multiple Experiments,  Up: A Mini Tutorial
1762*c42dbd0eSchristos
1763*c42dbd0eSchristos3.4 Profile Hardware Event Counters
1764*c42dbd0eSchristos===================================
1765*c42dbd0eSchristos
1766*c42dbd0eSchristosMany processors provide a set of hardware event counters and 'gprofng'
1767*c42dbd0eSchristosprovides support for this feature.  *Note Hardware Event Counters
1768*c42dbd0eSchristosExplained:: for those readers that are not familiar with such counters
1769*c42dbd0eSchristosand like to learn more.
1770*c42dbd0eSchristos
1771*c42dbd0eSchristos   In this section we explain how to get the details on the event
1772*c42dbd0eSchristoscounter support for the processor used in the experiment(s), and show
1773*c42dbd0eSchristosseveral examples.
1774*c42dbd0eSchristos
1775*c42dbd0eSchristos* Menu:
1776*c42dbd0eSchristos
1777*c42dbd0eSchristos* Getting Information on the Counters Supported::
1778*c42dbd0eSchristos* Examples Using Hardware Event Counters::
1779*c42dbd0eSchristos
1780*c42dbd0eSchristos
1781*c42dbd0eSchristosFile: gprofng.info,  Node: Getting Information on the Counters Supported,  Next: Examples Using Hardware Event Counters,  Up: Profile Hardware Event Counters
1782*c42dbd0eSchristos
1783*c42dbd0eSchristos3.4.1 Getting Information on the Counters Supported
1784*c42dbd0eSchristos---------------------------------------------------
1785*c42dbd0eSchristos
1786*c42dbd0eSchristosThe first step is to check if the processor used for the experiments is
1787*c42dbd0eSchristossupported by 'gprofng'.
1788*c42dbd0eSchristos
1789*c42dbd0eSchristos   The '-h' option on 'gprofng collect app' will show the event counter
1790*c42dbd0eSchristosinformation:
1791*c42dbd0eSchristos
1792*c42dbd0eSchristos     $ gprofng collect app -h
1793*c42dbd0eSchristos
1794*c42dbd0eSchristos   In case the counters are supported, a list with the events is
1795*c42dbd0eSchristosprinted.  Otherwise, a warning message will be issued.
1796*c42dbd0eSchristos
1797*c42dbd0eSchristos   For example, below we show this command and the output on an Intel
1798*c42dbd0eSchristosXeon Platinum 8167M (aka "Skylake") processor.  The output has been
1799*c42dbd0eSchristossplit into several sections and each section is commented upon
1800*c42dbd0eSchristosseparately.
1801*c42dbd0eSchristos
1802*c42dbd0eSchristos     Run "gprofng collect app --help" for a usage message.
1803*c42dbd0eSchristos
1804*c42dbd0eSchristos     Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85'
1805*c42dbd0eSchristos     (cpuver=2499):
1806*c42dbd0eSchristos
1807*c42dbd0eSchristos       -h {auto|lo|on|hi}
1808*c42dbd0eSchristos     	turn on default set of HW counters at the specified rate
1809*c42dbd0eSchristos       -h <ctr_def> [-h <ctr_def>]...
1810*c42dbd0eSchristos       -h <ctr_def>[,<ctr_def>]...
1811*c42dbd0eSchristos     	specify HW counter profiling for up to 4 HW counters
1812*c42dbd0eSchristos
1813*c42dbd0eSchristos   The first line shows how to get a usage overview.  This is followed
1814*c42dbd0eSchristosby some information on the target processor.
1815*c42dbd0eSchristos
1816*c42dbd0eSchristos   The next five lines explain in what ways the '-h' option can be used
1817*c42dbd0eSchristosto define the events to be monitored.
1818*c42dbd0eSchristos
1819*c42dbd0eSchristos   The first version shown above enables a default set of counters.
1820*c42dbd0eSchristosThis default depends on the processor this command is executed on.  The
1821*c42dbd0eSchristoskeyword following the '-h' option defines the sampling rate:
1822*c42dbd0eSchristos
1823*c42dbd0eSchristos'auto'
1824*c42dbd0eSchristos     Match the sample rate of used by clock profiling.  If the latter is
1825*c42dbd0eSchristos     disabled, Use a per thread sampling rate of approximately 100
1826*c42dbd0eSchristos     samples per second.  This setting is the default and preferred.
1827*c42dbd0eSchristos
1828*c42dbd0eSchristos'on'
1829*c42dbd0eSchristos     Use a per thread sampling rate of approximately 100 samples per
1830*c42dbd0eSchristos     second.
1831*c42dbd0eSchristos
1832*c42dbd0eSchristos'lo'
1833*c42dbd0eSchristos     Use a per thread sampling rate of approximately 10 samples per
1834*c42dbd0eSchristos     second.
1835*c42dbd0eSchristos
1836*c42dbd0eSchristos'hi'
1837*c42dbd0eSchristos     Use a per thread sampling rate of approximately 1000 samples per
1838*c42dbd0eSchristos     second.
1839*c42dbd0eSchristos
1840*c42dbd0eSchristos   The second and third variant define the events to be monitored.  Note
1841*c42dbd0eSchristosthat the number of simultaneous events supported is printed.  In this
1842*c42dbd0eSchristoscase we can monitor four events in a single profiling job.
1843*c42dbd0eSchristos
1844*c42dbd0eSchristos   It is a matter of preference whether you like to use the '-h' option
1845*c42dbd0eSchristosfor each event, or use it once, followed by a comma separated list.
1846*c42dbd0eSchristos
1847*c42dbd0eSchristos   There is one slight catch though.  The counter definition below has
1848*c42dbd0eSchristosmandatory comma (',') between the event and the rate.  While a default
1849*c42dbd0eSchristoscan be used for the rate, the comma cannot be omitted.  This may result
1850*c42dbd0eSchristosin a somewhat awkward counter definition in case the default sampling
1851*c42dbd0eSchristosrate is used.
1852*c42dbd0eSchristos
1853*c42dbd0eSchristos   For example, the following two commands are equivalent.  Note the
1854*c42dbd0eSchristosdouble comma in the second command.  This is not a typo.
1855*c42dbd0eSchristos
1856*c42dbd0eSchristos     $ gprofng collect app -h cycles -h insts ...
1857*c42dbd0eSchristos     $ gprofng collect app -h cycles,,insts ...
1858*c42dbd0eSchristos
1859*c42dbd0eSchristos   In the first command this comma is not needed, because a comma
1860*c42dbd0eSchristos("','") immediately followed by white space may be omitted.
1861*c42dbd0eSchristos
1862*c42dbd0eSchristos   This is why we prefer the this syntax and in the remainder will use
1863*c42dbd0eSchristosthe first version of this command.
1864*c42dbd0eSchristos
1865*c42dbd0eSchristos   The counter definition takes an event name, plus optionally one or
1866*c42dbd0eSchristosmore attributes, followed by a comma, and optionally the sampling rate.
1867*c42dbd0eSchristosThe output section below shows the formal definition.
1868*c42dbd0eSchristos
1869*c42dbd0eSchristos       <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
1870*c42dbd0eSchristos
1871*c42dbd0eSchristos   The printed help then explains this syntax.  Below we have summarized
1872*c42dbd0eSchristosand expanded this output:
1873*c42dbd0eSchristos
1874*c42dbd0eSchristos'<ctr>'
1875*c42dbd0eSchristos     The counter name must be selected from the available counters
1876*c42dbd0eSchristos     listed as part of the output printed with the '-h' option.  On most
1877*c42dbd0eSchristos     systems, if a counter is not listed, it may still be specified by
1878*c42dbd0eSchristos     its numeric value.
1879*c42dbd0eSchristos
1880*c42dbd0eSchristos'~<attr>=<val>'
1881*c42dbd0eSchristos     This is an optional attribute that depends on the processor.  The
1882*c42dbd0eSchristos     list of supported attributes is printed in the output.  Examples of
1883*c42dbd0eSchristos     attributes are "user", or "system".  The value can given in decimal
1884*c42dbd0eSchristos     or hexadecimal format.  Multiple attributes may be specified, and
1885*c42dbd0eSchristos     each must be preceded by a ~.
1886*c42dbd0eSchristos
1887*c42dbd0eSchristos'<rate>'
1888*c42dbd0eSchristos
1889*c42dbd0eSchristos     The sampling rate is one of the following:
1890*c42dbd0eSchristos
1891*c42dbd0eSchristos     'auto'
1892*c42dbd0eSchristos          This is the default and matches the rate used by clock
1893*c42dbd0eSchristos          profiling.  If clock profiling is disabled, use 'on'.
1894*c42dbd0eSchristos
1895*c42dbd0eSchristos     'on'
1896*c42dbd0eSchristos          Set the per thread maximum sampling rate to ~100
1897*c42dbd0eSchristos          samples/second
1898*c42dbd0eSchristos
1899*c42dbd0eSchristos     'lo'
1900*c42dbd0eSchristos          Set the per thread maximum sampling rate to ~10 samples/second
1901*c42dbd0eSchristos
1902*c42dbd0eSchristos     'hi'
1903*c42dbd0eSchristos          Set the per thread maximum sampling rate to ~1000
1904*c42dbd0eSchristos          samples/second
1905*c42dbd0eSchristos
1906*c42dbd0eSchristos     '<interval>'
1907*c42dbd0eSchristos          Define the sampling interval.  *Note Control the Sampling
1908*c42dbd0eSchristos          Frequency:: how to define this.
1909*c42dbd0eSchristos
1910*c42dbd0eSchristos   After the section with the formal definition of events and counters,
1911*c42dbd0eSchristosa processor specific list is displayed.  This part starts with an
1912*c42dbd0eSchristosoverview of the default set of counters and the aliased names supported
1913*c42dbd0eSchristos_on this specific processor_.
1914*c42dbd0eSchristos
1915*c42dbd0eSchristos     Default set of HW counters:
1916*c42dbd0eSchristos
1917*c42dbd0eSchristos         -h cycles,,insts,,llm
1918*c42dbd0eSchristos
1919*c42dbd0eSchristos     Aliases for most useful HW counters:
1920*c42dbd0eSchristos
1921*c42dbd0eSchristos      alias    raw name                   type units regs description
1922*c42dbd0eSchristos
1923*c42dbd0eSchristos      cycles   unhalted-core-cycles   CPU-cycles 0123 CPU Cycles
1924*c42dbd0eSchristos      insts    instruction-retired        events 0123 Instructions Executed
1925*c42dbd0eSchristos      llm      llc-misses                 events 0123 Last-Level Cache Misses
1926*c42dbd0eSchristos      br_msp   branch-misses-retired      events 0123 Branch Mispredict
1927*c42dbd0eSchristos      br_ins   branch-instruction-retired events 0123 Branch Instructions
1928*c42dbd0eSchristos
1929*c42dbd0eSchristos   The definitions given above may or may not be available on other
1930*c42dbd0eSchristosprocessors, but we try to maximize the overlap across alias sets.
1931*c42dbd0eSchristos
1932*c42dbd0eSchristos   The table above shows the default set of counters defined for this
1933*c42dbd0eSchristosprocessor, and the aliases.  For each alias the full "raw" name is
1934*c42dbd0eSchristosgiven, plus the unit of the number returned by the counter (CPU cycles,
1935*c42dbd0eSchristosor a raw count), the hardware counter the event is allowed to be mapped
1936*c42dbd0eSchristosonto, and a short description.
1937*c42dbd0eSchristos
1938*c42dbd0eSchristos   The last part of the output contains all the events that can be
1939*c42dbd0eSchristosmonitored:
1940*c42dbd0eSchristos
1941*c42dbd0eSchristos     Raw HW counters:
1942*c42dbd0eSchristos
1943*c42dbd0eSchristos         name                                type      units regs description
1944*c42dbd0eSchristos
1945*c42dbd0eSchristos         unhalted-core-cycles                     CPU-cycles 0123
1946*c42dbd0eSchristos         unhalted-reference-cycles                    events 0123
1947*c42dbd0eSchristos         instruction-retired                          events 0123
1948*c42dbd0eSchristos         llc-reference                                events 0123
1949*c42dbd0eSchristos         llc-misses                                   events 0123
1950*c42dbd0eSchristos         branch-instruction-retired                   events 0123
1951*c42dbd0eSchristos         branch-misses-retired                        events 0123
1952*c42dbd0eSchristos         ld_blocks.store_forward                      events 0123
1953*c42dbd0eSchristos         ld_blocks.no_sr                              events 0123
1954*c42dbd0eSchristos         ld_blocks_partial.address_alias              events 0123
1955*c42dbd0eSchristos         dtlb_load_misses.miss_causes_a_walk          events 0123
1956*c42dbd0eSchristos         dtlb_load_misses.walk_completed_4k           events 0123
1957*c42dbd0eSchristos
1958*c42dbd0eSchristos         <many lines deleted>
1959*c42dbd0eSchristos
1960*c42dbd0eSchristos         l2_lines_out.silent                          events 0123
1961*c42dbd0eSchristos         l2_lines_out.non_silent                      events 0123
1962*c42dbd0eSchristos         l2_lines_out.useless_hwpf                    events 0123
1963*c42dbd0eSchristos         sq_misc.split_lock                           events 0123
1964*c42dbd0eSchristos
1965*c42dbd0eSchristos     See Chapter 19 of the "Intel 64 and IA-32 Architectures Software
1966*c42dbd0eSchristos     Developer's Manual Volume 3B: System Programming Guide"
1967*c42dbd0eSchristos
1968*c42dbd0eSchristos   As can be seen, these names are not always easy to correlate to a
1969*c42dbd0eSchristosspecific event of interest.  The processor manual should provide more
1970*c42dbd0eSchristosclarity on this.
1971*c42dbd0eSchristos
1972*c42dbd0eSchristos
1973*c42dbd0eSchristosFile: gprofng.info,  Node: Examples Using Hardware Event Counters,  Prev: Getting Information on the Counters Supported,  Up: Profile Hardware Event Counters
1974*c42dbd0eSchristos
1975*c42dbd0eSchristos3.4.2 Examples Using Hardware Event Counters
1976*c42dbd0eSchristos--------------------------------------------
1977*c42dbd0eSchristos
1978*c42dbd0eSchristosThe previous section may give the impression that these counters are
1979*c42dbd0eSchristoshard to use, but as we will show now, in practice it is quite simple.
1980*c42dbd0eSchristos
1981*c42dbd0eSchristos   With the information from the '-h' option, we can easily set up our
1982*c42dbd0eSchristosfirst event counter experiment.
1983*c42dbd0eSchristos
1984*c42dbd0eSchristos   We start by using the default set of counters defined for our
1985*c42dbd0eSchristosprocessor and we use 2 threads:
1986*c42dbd0eSchristos
1987*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
1988*c42dbd0eSchristos     $ m=3000
1989*c42dbd0eSchristos     $ n=2000
1990*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
1991*c42dbd0eSchristos     $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
1992*c42dbd0eSchristos
1993*c42dbd0eSchristos   The new option here is '-h auto'.  The 'auto' keyword enables
1994*c42dbd0eSchristoshardware event counter profiling and selects the default set of counters
1995*c42dbd0eSchristosdefined for this processor.
1996*c42dbd0eSchristos
1997*c42dbd0eSchristos   As before, we can display the information, but there is one practical
1998*c42dbd0eSchristoshurdle to take.  Unless we like to view all metrics recorded, we would
1999*c42dbd0eSchristosneed to know the names of the events that have been enabled.  This is
2000*c42dbd0eSchristostedious and also not portable in case we would like to repeat this
2001*c42dbd0eSchristosexperiment on another processor.
2002*c42dbd0eSchristos
2003*c42dbd0eSchristos   This is where the special 'hwc' metric comes very handy.  It
2004*c42dbd0eSchristosautomatically expands to the active set of events used.
2005*c42dbd0eSchristos
2006*c42dbd0eSchristos   With this, it is very easy to display the event counter values.  Note
2007*c42dbd0eSchristosthat although the regular clock based profiling was enabled, we only
2008*c42dbd0eSchristoswant to see the counter values.  We also request to see the percentages
2009*c42dbd0eSchristosand limit the output to the first 5 lines:
2010*c42dbd0eSchristos
2011*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
2012*c42dbd0eSchristos     $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
2013*c42dbd0eSchristos
2014*c42dbd0eSchristos     Current metrics: e.%cycles:e+%insts:e+%llm:name
2015*c42dbd0eSchristos     Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
2016*c42dbd0eSchristos     Print limit set to 5
2017*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
2018*c42dbd0eSchristos
2019*c42dbd0eSchristos     Excl. CPU     Excl. Instructions  Excl. Last-Level   Name
2020*c42dbd0eSchristos     Cycles        Executed            Cache Misses
2021*c42dbd0eSchristos      sec.      %                  %                 %
2022*c42dbd0eSchristos     2.691 100.00  7906475309 100.00   122658983 100.00   <Total>
2023*c42dbd0eSchristos     2.598  96.54  7432724378  94.01   121745696  99.26   mxv_core
2024*c42dbd0eSchristos     0.035   1.31   188860269   2.39       70084   0.06   erand48_r
2025*c42dbd0eSchristos     0.026   0.95    73623396   0.93      763116   0.62   init_data
2026*c42dbd0eSchristos     0.018   0.66    76824434   0.97       40040   0.03   drand48
2027*c42dbd0eSchristos
2028*c42dbd0eSchristos   As we have seen before, the first few lines echo the settings.  This
2029*c42dbd0eSchristosincludes a list with the hardware event counters used by default.
2030*c42dbd0eSchristos
2031*c42dbd0eSchristos   The table that follows makes it very easy to get an overview where
2032*c42dbd0eSchristosthe time is spent and how many of the target events have occurred.
2033*c42dbd0eSchristos
2034*c42dbd0eSchristos   As before, we can drill down deeper and see the same metrics at the
2035*c42dbd0eSchristossource line and instruction level.  Other than using 'hwc' in the
2036*c42dbd0eSchristosmetrics definitions, nothing has changed compared to the previous
2037*c42dbd0eSchristosexamples:
2038*c42dbd0eSchristos
2039*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
2040*c42dbd0eSchristos     $ gprofng display text -metrics e.hwc -source mxv_core $exp
2041*c42dbd0eSchristos
2042*c42dbd0eSchristos   This is the relevant part of the output.  Since the lines get very
2043*c42dbd0eSchristoslong, we have somewhat modified the lay-out:
2044*c42dbd0eSchristos
2045*c42dbd0eSchristos        Excl. CPU Excl.        Excl.
2046*c42dbd0eSchristos        Cycles    Instructions Last-Level
2047*c42dbd0eSchristos         sec.     Executed     Cache Misses
2048*c42dbd0eSchristos                                              <Function: mxv_core>
2049*c42dbd0eSchristos        0.                 0          0   32. void __attribute__ ((noinline))
2050*c42dbd0eSchristos                                              mxv_core(...)
2051*c42dbd0eSchristos        0.                 0          0   33. {
2052*c42dbd0eSchristos        0.                 0          0   34.   for (uint64_t i=...) {
2053*c42dbd0eSchristos        0.                 0          0   35.     double row_sum = 0.0;
2054*c42dbd0eSchristos     ## 1.872     7291879319   88150571   36.     for (int64_t j=0; j<n; j++)
2055*c42dbd0eSchristos        0.725      140845059   33595125   37.        row_sum += A[i][j]*b[j];
2056*c42dbd0eSchristos        0.                 0          0   38.     c[i] = row_sum;
2057*c42dbd0eSchristos                                          39.    }
2058*c42dbd0eSchristos        0.                 0          0   40. }
2059*c42dbd0eSchristos
2060*c42dbd0eSchristos   In a smiliar way we can display the event counter values at the
2061*c42dbd0eSchristosinstruction level.  Again we have modified the lay-out due to page width
2062*c42dbd0eSchristoslimitations:
2063*c42dbd0eSchristos
2064*c42dbd0eSchristos     $ exp=mxv.hwc.def.2.thr.er
2065*c42dbd0eSchristos     $ gprofng display text -metrics e.hwc -disasm mxv_core $exp
2066*c42dbd0eSchristos
2067*c42dbd0eSchristos        Excl. CPU Excl.        Excl.
2068*c42dbd0eSchristos        Cycles    Instructions Last-Level
2069*c42dbd0eSchristos         sec.     Executed     Cache Misses
2070*c42dbd0eSchristos                                                     <Function: mxv_core>
2071*c42dbd0eSchristos        0.                 0          0  [33] 4021ba: mov   0x8(%rsp),%r10
2072*c42dbd0eSchristos                                         34.   for (uint64_t i=...) {
2073*c42dbd0eSchristos        0.                 0          0  [34] 4021bf: cmp   %rsi,%rdi
2074*c42dbd0eSchristos        0.                 0          0  [34] 4021c2: jbe   0x37
2075*c42dbd0eSchristos        0.                 0          0  [34] 4021c4: ret
2076*c42dbd0eSchristos                                         35.       double row_sum = 0.0;
2077*c42dbd0eSchristos                                         36.       for (int64_t j=0; j<n; j++)
2078*c42dbd0eSchristos                                         37.         row_sum += A[i][j]*b[j];
2079*c42dbd0eSchristos        0.                 0          0  [37] 4021c5: mov   (%r8,%rdi,8),%rdx
2080*c42dbd0eSchristos        0.                 0          0  [36] 4021c9: mov   $0x0,%eax
2081*c42dbd0eSchristos        0.                 0          0  [35] 4021ce: pxor  %xmm1,%xmm1
2082*c42dbd0eSchristos        0.002       12804230     321394  [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
2083*c42dbd0eSchristos        0.141       60819025    3866677  [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
2084*c42dbd0eSchristos        0.582       67221804   29407054  [37] 4021dd: addsd %xmm0,%xmm1
2085*c42dbd0eSchristos     ## 1.871     7279075109   87989870  [36] 4021e1: add   $0x1,%rax
2086*c42dbd0eSchristos        0.002       12804210      80351  [36] 4021e5: cmp   %rax,%rcx
2087*c42dbd0eSchristos        0.                 0          0  [36] 4021e8: jne   0xffffffffffffffea
2088*c42dbd0eSchristos                                         38.       c[i] = row_sum;
2089*c42dbd0eSchristos        0.                 0          0  [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
2090*c42dbd0eSchristos        0.                 0          0  [34] 4021f0: add   $0x1,%rdi
2091*c42dbd0eSchristos        0.                 0          0  [34] 4021f4: cmp   %rdi,%rsi
2092*c42dbd0eSchristos        0.                 0          0  [34] 4021f7: jb    0xd
2093*c42dbd0eSchristos        0.                 0          0  [35] 4021f9: pxor  %xmm1,%xmm1
2094*c42dbd0eSchristos        0.                 0          0  [36] 4021fd: test  %rcx,%rcx
2095*c42dbd0eSchristos        0.                 0      80350  [36] 402200: jne   0xffffffffffffffc5
2096*c42dbd0eSchristos        0.                 0          0  [36] 402202: jmp   0xffffffffffffffe8
2097*c42dbd0eSchristos                                         39.   }
2098*c42dbd0eSchristos                                         40. }
2099*c42dbd0eSchristos        0.                 0          0  [40]  402204:  ret
2100*c42dbd0eSchristos
2101*c42dbd0eSchristos   So far we have used the default settings for the event counters.  It
2102*c42dbd0eSchristosis quite straightforward to select specific counters.  For sake of the
2103*c42dbd0eSchristosexample, let's assume we would like to count how many branch
2104*c42dbd0eSchristosinstructions and retired memory load instructions that missed in the L1
2105*c42dbd0eSchristoscache have been executed.  We also want to count these events with a
2106*c42dbd0eSchristoshigh resolution.
2107*c42dbd0eSchristos
2108*c42dbd0eSchristos   This is the command to do so:
2109*c42dbd0eSchristos
2110*c42dbd0eSchristos     $ exe=mxv-pthreads.exe
2111*c42dbd0eSchristos     $ m=3000
2112*c42dbd0eSchristos     $ n=2000
2113*c42dbd0eSchristos     $ exp=mxv.hwc.sel.2.thr.er
2114*c42dbd0eSchristos     $ hwc1=br_ins,hi
2115*c42dbd0eSchristos     $ hwc2=mem_load_retired.l1_miss,hi
2116*c42dbd0eSchristos     $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
2117*c42dbd0eSchristos
2118*c42dbd0eSchristos   As before, we get a table with the event counts.  Due to the very
2119*c42dbd0eSchristoslong name for the second counter, we have somewhat modified the output.
2120*c42dbd0eSchristos
2121*c42dbd0eSchristos     $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
2122*c42dbd0eSchristos
2123*c42dbd0eSchristos     Functions sorted by metric: Exclusive Total CPU Time
2124*c42dbd0eSchristos     Excl.     Incl.     Excl. Branch  Excl.                 Name
2125*c42dbd0eSchristos     Total     Total     Instructions  mem_load_retired.l1_miss
2126*c42dbd0eSchristos     CPU sec.  CPU sec.                Events
2127*c42dbd0eSchristos     2.597     2.597     1305305319    4021340               <Total>
2128*c42dbd0eSchristos     2.481     2.481     1233233242    3982327               mxv_core
2129*c42dbd0eSchristos     0.040     0.107       19019012       9003               init_data
2130*c42dbd0eSchristos     0.028     0.052       23023048      15006               erand48_r
2131*c42dbd0eSchristos     0.024     0.024       19019008       9004               __drand48_iterate
2132*c42dbd0eSchristos     0.015     0.067       11011009       2998               drand48
2133*c42dbd0eSchristos     0.008     0.010              0       3002               _int_malloc
2134*c42dbd0eSchristos     0.001     0.001              0          0               brk
2135*c42dbd0eSchristos     0.001     0.002              0          0               sysmalloc
2136*c42dbd0eSchristos     0.        0.001              0          0               __default_morecore
2137*c42dbd0eSchristos
2138*c42dbd0eSchristos   When using event counters, the values could be very large and it is
2139*c42dbd0eSchristosnot easy to compare the numbers.  As we will show next, the 'ratio'
2140*c42dbd0eSchristosfeature is very useful when comparing such profiles.
2141*c42dbd0eSchristos
2142*c42dbd0eSchristos   To demonstrate this, we have set up another event counter experiment
2143*c42dbd0eSchristoswhere we would like to compare the number of last level cache miss and
2144*c42dbd0eSchristosthe number of branch instructions executed when using a single thread,
2145*c42dbd0eSchristosor two threads.
2146*c42dbd0eSchristos
2147*c42dbd0eSchristos   These are the commands used to generate the experiment directories:
2148*c42dbd0eSchristos
2149*c42dbd0eSchristos     $ exe=./mxv-pthreads.exe
2150*c42dbd0eSchristos     $ m=3000
2151*c42dbd0eSchristos     $ n=2000
2152*c42dbd0eSchristos     $ exp1=mxv.hwc.comp.1.thr.er
2153*c42dbd0eSchristos     $ exp2=mxv.hwc.comp.2.thr.er
2154*c42dbd0eSchristos     $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
2155*c42dbd0eSchristos     $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
2156*c42dbd0eSchristos
2157*c42dbd0eSchristos   The following script has been used to get the tables.  Due to lay-out
2158*c42dbd0eSchristosrestrictions, we have to create two tables, one for each counter.
2159*c42dbd0eSchristos
2160*c42dbd0eSchristos     # Limit the output to 5 lines
2161*c42dbd0eSchristos     limit 5
2162*c42dbd0eSchristos     # Define the metrics
2163*c42dbd0eSchristos     metrics name:e.llm
2164*c42dbd0eSchristos     # Set the comparison to ratio
2165*c42dbd0eSchristos     compare ratio
2166*c42dbd0eSchristos     functions
2167*c42dbd0eSchristos     # Define the metrics
2168*c42dbd0eSchristos     metrics name:e.br_ins
2169*c42dbd0eSchristos     # Set the comparison to ratio
2170*c42dbd0eSchristos     compare ratio
2171*c42dbd0eSchristos     functions
2172*c42dbd0eSchristos
2173*c42dbd0eSchristos   Note that we print the name of the function first, followed by the
2174*c42dbd0eSchristoscounter data.  The new element is that we set the comparison mode to
2175*c42dbd0eSchristos'ratio'.  This divides the data in a column by its counterpart in the
2176*c42dbd0eSchristosreference experiment.
2177*c42dbd0eSchristos
2178*c42dbd0eSchristos   This is the command using this script and the two experiment
2179*c42dbd0eSchristosdirectories as input:
2180*c42dbd0eSchristos
2181*c42dbd0eSchristos     $ gprofng display text -script my-script-comp-counters \
2182*c42dbd0eSchristos       mxv.hwc.comp.1.thr.er \
2183*c42dbd0eSchristos       mxv.hwc.comp.2.thr.er
2184*c42dbd0eSchristos
2185*c42dbd0eSchristos   By design, we get two tables, one for each counter:
2186*c42dbd0eSchristos
2187*c42dbd0eSchristos     Functions sorted by metric: Exclusive Last-Level Cache Misses
2188*c42dbd0eSchristos
2189*c42dbd0eSchristos                                   mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
2190*c42dbd0eSchristos     Name                          Excl. Last-Level       Excl. Last-Level
2191*c42dbd0eSchristos                                   Cache Misses           Cache Misses
2192*c42dbd0eSchristos                                                              ratio
2193*c42dbd0eSchristos      <Total>                      122709276              x   0.788
2194*c42dbd0eSchristos      mxv_core                     121796001              x   0.787
2195*c42dbd0eSchristos      init_data                       723064              x   1.055
2196*c42dbd0eSchristos      erand48_r                       100111              x   0.500
2197*c42dbd0eSchristos      drand48                          60065              x   1.167
2198*c42dbd0eSchristos
2199*c42dbd0eSchristos     Functions sorted by metric: Exclusive Branch Instructions
2200*c42dbd0eSchristos
2201*c42dbd0eSchristos                                   mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
2202*c42dbd0eSchristos     Name                          Excl. Branch           Excl. Branch
2203*c42dbd0eSchristos                                   Instructions           Instructions
2204*c42dbd0eSchristos                                                            ratio
2205*c42dbd0eSchristos      <Total>                      1307307316             x 0.997
2206*c42dbd0eSchristos      mxv_core                     1235235239             x 0.997
2207*c42dbd0eSchristos      erand48_r                      23023033             x 0.957
2208*c42dbd0eSchristos      drand48                        20020009             x 0.600
2209*c42dbd0eSchristos      __drand48_iterate              17017028             x 0.882
2210*c42dbd0eSchristos
2211*c42dbd0eSchristos   A ratio less than one in the second column, means that this counter
2212*c42dbd0eSchristosvalue was smaller than the value from the reference experiment shown in
2213*c42dbd0eSchristosthe first column.
2214*c42dbd0eSchristos
2215*c42dbd0eSchristos   This kind of presentation of the results makes it much easier to
2216*c42dbd0eSchristosquickly interpret the data.
2217*c42dbd0eSchristos
2218*c42dbd0eSchristos   We conclude this section with thread-level event counter overviews,
2219*c42dbd0eSchristosbut before we go into this, there is an important metric we need to
2220*c42dbd0eSchristosmention.
2221*c42dbd0eSchristos
2222*c42dbd0eSchristos   In case it is known how many instructions and CPU cycles have been
2223*c42dbd0eSchristosexecuted, the value for the IPC ("Instructions Per Clockycle") can be
2224*c42dbd0eSchristoscomputed.  *Note Hardware Event Counters Explained::.  This is a derived
2225*c42dbd0eSchristosmetric that gives an indication how well the processor is utilized.  The
2226*c42dbd0eSchristosinverse of the IPC is called CPI.
2227*c42dbd0eSchristos
2228*c42dbd0eSchristos   The 'gprofng display text' command automatically computes the IPC and
2229*c42dbd0eSchristosCPI values if an experiment contains the event counter values for the
2230*c42dbd0eSchristosinstructions and CPU cycles executed.  These are part of the metric list
2231*c42dbd0eSchristosand can be displayed, just like any other metric.
2232*c42dbd0eSchristos
2233*c42dbd0eSchristos   This can be verified through the 'metric_list' command.  If we go
2234*c42dbd0eSchristosback to our earlier experiment with the default event counters, we get
2235*c42dbd0eSchristosthe following result.
2236*c42dbd0eSchristos
2237*c42dbd0eSchristos     $ gprofng display text -metric_list mxv.hwc.def.2.thr.er
2238*c42dbd0eSchristos
2239*c42dbd0eSchristos     Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
2240*c42dbd0eSchristos     Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
2241*c42dbd0eSchristos     Available metrics:
2242*c42dbd0eSchristos              Exclusive Total CPU Time: e.%totalcpu
2243*c42dbd0eSchristos              Inclusive Total CPU Time: i.%totalcpu
2244*c42dbd0eSchristos                  Exclusive CPU Cycles: e.+%cycles
2245*c42dbd0eSchristos                  Inclusive CPU Cycles: i.+%cycles
2246*c42dbd0eSchristos       Exclusive Instructions Executed: e+%insts
2247*c42dbd0eSchristos       Inclusive Instructions Executed: i+%insts
2248*c42dbd0eSchristos     Exclusive Last-Level Cache Misses: e+%llm
2249*c42dbd0eSchristos     Inclusive Last-Level Cache Misses: i+%llm
2250*c42dbd0eSchristos      Exclusive Instructions Per Cycle: e+IPC
2251*c42dbd0eSchristos      Inclusive Instructions Per Cycle: i+IPC
2252*c42dbd0eSchristos      Exclusive Cycles Per Instruction: e+CPI
2253*c42dbd0eSchristos      Inclusive Cycles Per Instruction: i+CPI
2254*c42dbd0eSchristos                                  Size: size
2255*c42dbd0eSchristos                            PC Address: address
2256*c42dbd0eSchristos                                  Name: name
2257*c42dbd0eSchristos
2258*c42dbd0eSchristos   Among the other metrics, we see the new metrics for the IPC and CPI
2259*c42dbd0eSchristoslisted.
2260*c42dbd0eSchristos
2261*c42dbd0eSchristos   In the script below, we use this information and add the IPC and CPI
2262*c42dbd0eSchristosto the metrics to be displayed.  We also use a the thread filter to
2263*c42dbd0eSchristosdisplay these values for the individual threads.
2264*c42dbd0eSchristos
2265*c42dbd0eSchristos   This is the complete script we have used.  Other than a different
2266*c42dbd0eSchristosselection of the metrics, there are no new features.
2267*c42dbd0eSchristos
2268*c42dbd0eSchristos     # Define the metrics
2269*c42dbd0eSchristos     metrics e.insts:e.%cycles:e.IPC:e.CPI
2270*c42dbd0eSchristos     # Sort with respect to cycles
2271*c42dbd0eSchristos     sort e.cycles
2272*c42dbd0eSchristos     # Limit the output to 5 lines
2273*c42dbd0eSchristos     limit 5
2274*c42dbd0eSchristos     # Get the function overview for all threads
2275*c42dbd0eSchristos     functions
2276*c42dbd0eSchristos     # Get the function overview for thread 1
2277*c42dbd0eSchristos     thread_select 1
2278*c42dbd0eSchristos     functions
2279*c42dbd0eSchristos     # Get the function overview for thread 2
2280*c42dbd0eSchristos     thread_select 2
2281*c42dbd0eSchristos     functions
2282*c42dbd0eSchristos     # Get the function overview for thread 3
2283*c42dbd0eSchristos     thread_select 3
2284*c42dbd0eSchristos     functions
2285*c42dbd0eSchristos
2286*c42dbd0eSchristos   In the metrics definition on the second line, we explicitly request
2287*c42dbd0eSchristosthe counter values for the instructions ('e.insts') and CPU cycles
2288*c42dbd0eSchristos('e.cycles') executed.  These names can be found in output from the
2289*c42dbd0eSchristos'metric_list' commad above.  In addition to these metrics, we also
2290*c42dbd0eSchristosrequest the IPC and CPI to be shown.
2291*c42dbd0eSchristos
2292*c42dbd0eSchristos   As before, we used the 'limit' command to control the number of
2293*c42dbd0eSchristosfunctions displayed.  We then request an overview for all the threads,
2294*c42dbd0eSchristosfollowed by three sets of two commands to select a thread and display
2295*c42dbd0eSchristosthe function overview.
2296*c42dbd0eSchristos
2297*c42dbd0eSchristos   The script above is used as follows:
2298*c42dbd0eSchristos
2299*c42dbd0eSchristos     $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
2300*c42dbd0eSchristos
2301*c42dbd0eSchristos   This script produces four tables.  We list them separately below, and
2302*c42dbd0eSchristoshave left out the additional output.
2303*c42dbd0eSchristos
2304*c42dbd0eSchristos   The first table shows the accumulated values across the three threads
2305*c42dbd0eSchristosthat have been active.
2306*c42dbd0eSchristos
2307*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
2308*c42dbd0eSchristos
2309*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
2310*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
2311*c42dbd0eSchristos     Executed       sec.      %
2312*c42dbd0eSchristos     7906475309    2.691 100.00  1.473  0.679   <Total>
2313*c42dbd0eSchristos     7432724378    2.598  96.54  1.434  0.697   mxv_core
2314*c42dbd0eSchristos      188860269    0.035   1.31  2.682  0.373   erand48_r
2315*c42dbd0eSchristos       73623396    0.026   0.95  1.438  0.696   init_data
2316*c42dbd0eSchristos       76824434    0.018   0.66  2.182  0.458   drand48
2317*c42dbd0eSchristos
2318*c42dbd0eSchristos   This shows that IPC of this program is completely dominated by
2319*c42dbd0eSchristosfunction 'mxv_core'.  It has a fairly low IPC value of 1.43.
2320*c42dbd0eSchristos
2321*c42dbd0eSchristos   The next table is for thread 1 and shows the values for the main
2322*c42dbd0eSchristosthread.
2323*c42dbd0eSchristos
2324*c42dbd0eSchristos     Exp Sel Total
2325*c42dbd0eSchristos     === === =====
2326*c42dbd0eSchristos       1 1       3
2327*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
2328*c42dbd0eSchristos
2329*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
2330*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
2331*c42dbd0eSchristos     Executed       sec.      %
2332*c42dbd0eSchristos     473750931     0.093 100.00  2.552  0.392   <Total>
2333*c42dbd0eSchristos     188860269     0.035  37.93  2.682  0.373   erand48_r
2334*c42dbd0eSchristos      73623396     0.026  27.59  1.438  0.696   init_data
2335*c42dbd0eSchristos      76824434     0.018  18.97  2.182  0.458   drand48
2336*c42dbd0eSchristos     134442832     0.013  13.79  5.250  0.190   __drand48_iterate
2337*c42dbd0eSchristos
2338*c42dbd0eSchristos   Although this thread hardly uses any CPU cycles, the overall IPC of
2339*c42dbd0eSchristos2.55 is not all that bad.
2340*c42dbd0eSchristos
2341*c42dbd0eSchristos   Last, we show the tables for threads 2 and 3:
2342*c42dbd0eSchristos
2343*c42dbd0eSchristos     Exp Sel Total
2344*c42dbd0eSchristos     === === =====
2345*c42dbd0eSchristos       1 2       3
2346*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
2347*c42dbd0eSchristos
2348*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
2349*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
2350*c42dbd0eSchristos     Executed       sec.      %
2351*c42dbd0eSchristos     3716362189    1.298 100.00  1.435  0.697   <Total>
2352*c42dbd0eSchristos     3716362189    1.298 100.00  1.435  0.697   mxv_core
2353*c42dbd0eSchristos              0    0.      0.    0.     0.      collector_root
2354*c42dbd0eSchristos              0    0.      0.    0.     0.      driver_mxv
2355*c42dbd0eSchristos
2356*c42dbd0eSchristos     Exp Sel Total
2357*c42dbd0eSchristos     === === =====
2358*c42dbd0eSchristos       1 3       3
2359*c42dbd0eSchristos     Functions sorted by metric: Exclusive CPU Cycles
2360*c42dbd0eSchristos
2361*c42dbd0eSchristos     Excl.         Excl. CPU     Excl.  Excl.   Name
2362*c42dbd0eSchristos     Instructions  Cycles        IPC    CPI
2363*c42dbd0eSchristos     Executed       sec.      %
2364*c42dbd0eSchristos     3716362189    1.300 100.00  1.433  0.698   <Total>
2365*c42dbd0eSchristos     3716362189    1.300 100.00  1.433  0.698   mxv_core
2366*c42dbd0eSchristos              0    0.      0.    0.     0.      collector_root
2367*c42dbd0eSchristos              0    0.      0.    0.     0.      driver_mxv
2368*c42dbd0eSchristos
2369*c42dbd0eSchristos   It is seen that both execute the same number of instructions and take
2370*c42dbd0eSchristosabout the same number of CPU cycles.  As a result, the IPC is the same
2371*c42dbd0eSchristosfor both threads.
2372*c42dbd0eSchristos
2373*c42dbd0eSchristos
2374*c42dbd0eSchristosFile: gprofng.info,  Node: Java Profiling,  Prev: Profile Hardware Event Counters,  Up: A Mini Tutorial
2375*c42dbd0eSchristos
2376*c42dbd0eSchristos3.5 Java Profiling
2377*c42dbd0eSchristos==================
2378*c42dbd0eSchristos
2379*c42dbd0eSchristosThe 'gprofng collect app' command supports Java profiling.  The '-j on'
2380*c42dbd0eSchristosoption can be used for this, but since this feature is enabled by
2381*c42dbd0eSchristosdefault, there is no need to set this explicitly.  Java profiling may be
2382*c42dbd0eSchristosdisabled through the '-j off' option.
2383*c42dbd0eSchristos
2384*c42dbd0eSchristos   The program is compiled as usual and the experiment directory is
2385*c42dbd0eSchristoscreated similar to what we have seen before.  The only difference with a
2386*c42dbd0eSchristosC/C++ application is that the program has to be explicitly executed by
2387*c42dbd0eSchristosjava.
2388*c42dbd0eSchristos
2389*c42dbd0eSchristos   For example, this is how to generate the experiment data for a Java
2390*c42dbd0eSchristosprogram that has the source code stored in file 'Pi.java':
2391*c42dbd0eSchristos
2392*c42dbd0eSchristos     $ javac Pi.java
2393*c42dbd0eSchristos     $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
2394*c42dbd0eSchristos
2395*c42dbd0eSchristos   Regarding which java is selected to generate the data, 'gprofng'
2396*c42dbd0eSchristosfirst looks for the JDK in the path set in either the 'JDK_HOME'
2397*c42dbd0eSchristosenvironment variable, or in the 'JAVA_PATH' environment variable.  If
2398*c42dbd0eSchristosneither of these variables is set, it checks for a JDK in the search
2399*c42dbd0eSchristospath (set in the PATH environment variable).  If there is no JDK in this
2400*c42dbd0eSchristospath, it checks for the java executable in '/usr/java/bin/java'.
2401*c42dbd0eSchristos
2402*c42dbd0eSchristos   In case additional options need to be passed on to the JVM, the '-J
2403*c42dbd0eSchristos<string>' option can be used.  The string with the option(s) has to be
2404*c42dbd0eSchristosdelimited by quotation marks in case there is more than one argument.
2405*c42dbd0eSchristos
2406*c42dbd0eSchristos   The 'gprofng display text' command may be used to view the
2407*c42dbd0eSchristosperformance data.  There is no need for any special options and the same
2408*c42dbd0eSchristoscommands as previously discussed are supported.
2409*c42dbd0eSchristos
2410*c42dbd0eSchristos   The 'viewmode' command *Note The Viewmode:: is very useful to examine
2411*c42dbd0eSchristosthe call stacks.
2412*c42dbd0eSchristos
2413*c42dbd0eSchristos   For example, this is how one can see the native call stacks.  For
2414*c42dbd0eSchristoslay-out purposes we have restricted the list to the first five entries:
2415*c42dbd0eSchristos
2416*c42dbd0eSchristos     $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er
2417*c42dbd0eSchristos
2418*c42dbd0eSchristos     Print limit set to 5
2419*c42dbd0eSchristos     Viewmode set to machine
2420*c42dbd0eSchristos     Functions Call Tree. Metric: Attributed Total CPU Time
2421*c42dbd0eSchristos
2422*c42dbd0eSchristos     Attr.      Name
2423*c42dbd0eSchristos     Total
2424*c42dbd0eSchristos     CPU sec.
2425*c42dbd0eSchristos     1.381      +-<Total>
2426*c42dbd0eSchristos     1.171        +-Pi.calculatePi(double)
2427*c42dbd0eSchristos     0.110        +-collector_root
2428*c42dbd0eSchristos     0.110        |  +-JavaMain
2429*c42dbd0eSchristos     0.070        |    +-jni_CallStaticVoidMethod
2430*c42dbd0eSchristos
2431*c42dbd0eSchristosNote that the selection of the viewmode is echoed in the output.
2432*c42dbd0eSchristos
2433*c42dbd0eSchristos
2434*c42dbd0eSchristosFile: gprofng.info,  Node: Terminology,  Next: Other Document Formats,  Prev: A Mini Tutorial,  Up: Top
2435*c42dbd0eSchristos
2436*c42dbd0eSchristos4 Terminology
2437*c42dbd0eSchristos*************
2438*c42dbd0eSchristos
2439*c42dbd0eSchristosThroughout this manual, certain terminology specific to profiling tools,
2440*c42dbd0eSchristosor 'gprofng', or even to this document only, is used.  In this chapter
2441*c42dbd0eSchristoswe explain this terminology in detail.
2442*c42dbd0eSchristos
2443*c42dbd0eSchristos* Menu:
2444*c42dbd0eSchristos
2445*c42dbd0eSchristos* The Program Counter::                    What is a Program Counter?
2446*c42dbd0eSchristos* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
2447*c42dbd0eSchristos* Metric Definitions::                     Definitions associated with metrics.
2448*c42dbd0eSchristos* The Viewmode::                           Select the way call stacks are presented.
2449*c42dbd0eSchristos* The Selection List::                     How to define a selection.
2450*c42dbd0eSchristos* Load Objects and Functions::             The components in an application.
2451*c42dbd0eSchristos* The Concept of a CPU in gprofng:: The definition of a CPU.
2452*c42dbd0eSchristos* Hardware Event Counters Explained::      What are event counters?
2453*c42dbd0eSchristos* apath::                                  Our generic definition of a path.
2454*c42dbd0eSchristos
2455*c42dbd0eSchristos
2456*c42dbd0eSchristosFile: gprofng.info,  Node: The Program Counter,  Next: Inclusive and Exclusive Metrics,  Up: Terminology
2457*c42dbd0eSchristos
2458*c42dbd0eSchristos4.1 The Program Counter
2459*c42dbd0eSchristos=======================
2460*c42dbd0eSchristos
2461*c42dbd0eSchristosThe _Program Counter_, or PC for short, keeps track where program
2462*c42dbd0eSchristosexecution is.  The address of the next instruction to be executed is
2463*c42dbd0eSchristosstored in a special purpose register in the processor, or core.
2464*c42dbd0eSchristos
2465*c42dbd0eSchristos   The PC is sometimes also referred to as the _instruction pointer_,
2466*c42dbd0eSchristosbut we will use Program Counter or PC throughout this document.
2467*c42dbd0eSchristos
2468*c42dbd0eSchristos
2469*c42dbd0eSchristosFile: gprofng.info,  Node: Inclusive and Exclusive Metrics,  Next: Metric Definitions,  Prev: The Program Counter,  Up: Terminology
2470*c42dbd0eSchristos
2471*c42dbd0eSchristos4.2 Inclusive and Exclusive Metrics
2472*c42dbd0eSchristos===================================
2473*c42dbd0eSchristos
2474*c42dbd0eSchristosIn the remainder, these two concepts occur quite often and for lack of a
2475*c42dbd0eSchristosbetter place, they are explained here.
2476*c42dbd0eSchristos
2477*c42dbd0eSchristos   The _inclusive_ value for a metric includes all values that are part
2478*c42dbd0eSchristosof the dynamic extent of the target function.  For example if function
2479*c42dbd0eSchristos'A' calls functions 'B' and 'C', the inclusive CPU time for 'A' includes
2480*c42dbd0eSchristosthe CPU time spent in 'B' and 'C'.
2481*c42dbd0eSchristos
2482*c42dbd0eSchristos   In contrast with this, the _exclusive_ value for a metric is computed
2483*c42dbd0eSchristosby excluding the metric values used by other functions called.  In our
2484*c42dbd0eSchristosimaginary example, the exclusive CPU time for function 'A' is the time
2485*c42dbd0eSchristosspent outside calling functions 'B' and 'C'.
2486*c42dbd0eSchristos
2487*c42dbd0eSchristos   In case of a _leaf function_, the inclusive and exclusive values for
2488*c42dbd0eSchristosthe metric are the same since by definition, it is not calling any other
2489*c42dbd0eSchristosfunction(s).
2490*c42dbd0eSchristos
2491*c42dbd0eSchristos   Why do we use these two different values?  The inclusive metric shows
2492*c42dbd0eSchristosthe most expensive path, in terms of this metric, in the application.
2493*c42dbd0eSchristosFor example, if the metric is cache misses, the function with the
2494*c42dbd0eSchristoshighest inclusive metric tells you where most of the cache misses come
2495*c42dbd0eSchristosfrom.
2496*c42dbd0eSchristos
2497*c42dbd0eSchristos   Within this branch of the application, the exclusive metric points to
2498*c42dbd0eSchristosthe functions that contribute and help to identify which part(s) to
2499*c42dbd0eSchristosconsider for further analysis.
2500*c42dbd0eSchristos
2501*c42dbd0eSchristos
2502*c42dbd0eSchristosFile: gprofng.info,  Node: Metric Definitions,  Next: The Viewmode,  Prev: Inclusive and Exclusive Metrics,  Up: Terminology
2503*c42dbd0eSchristos
2504*c42dbd0eSchristos4.3 Metric Definitions
2505*c42dbd0eSchristos======================
2506*c42dbd0eSchristos
2507*c42dbd0eSchristosThe metrics to be shown are highly customizable.  In this section we
2508*c42dbd0eSchristosexplain the definitions associated with metrics.
2509*c42dbd0eSchristos
2510*c42dbd0eSchristos   The 'metrics' command takes a colon (:) separated list with special
2511*c42dbd0eSchristoskeywords.  This keyword consists of the following three fields:
2512*c42dbd0eSchristos'<flavor>''<visibility>''<metric_name>'.
2513*c42dbd0eSchristos
2514*c42dbd0eSchristos   The _<flavor>_ field is either an 'e' for "exclusive", or 'i' for
2515*c42dbd0eSchristos"inclusive".  The '<metric_name>' field is the name of the metric
2516*c42dbd0eSchristosrequest.  The _<visibility>_ field consists of one ore more characters
2517*c42dbd0eSchristosfrom the following table:
2518*c42dbd0eSchristos
2519*c42dbd0eSchristos'.'
2520*c42dbd0eSchristos     Show the metric as time.  This applies to timing metrics and
2521*c42dbd0eSchristos     hardware event counters that measure cycles.  Interpret as '+' for
2522*c42dbd0eSchristos     other metrics.
2523*c42dbd0eSchristos
2524*c42dbd0eSchristos'%'
2525*c42dbd0eSchristos     Show the metric as a percentage of the total value for this metric.
2526*c42dbd0eSchristos
2527*c42dbd0eSchristos'+'
2528*c42dbd0eSchristos     Show the metric as an absolute value.  For hardware event counters
2529*c42dbd0eSchristos     this is the event count.  Interpret as '.' for timing metrics.
2530*c42dbd0eSchristos
2531*c42dbd0eSchristos'|'
2532*c42dbd0eSchristos     Do not show any metric value.  Cannot be used with other visibility
2533*c42dbd0eSchristos     characters.
2534*c42dbd0eSchristos
2535*c42dbd0eSchristos
2536*c42dbd0eSchristosFile: gprofng.info,  Node: The Viewmode,  Next: The Selection List,  Prev: Metric Definitions,  Up: Terminology
2537*c42dbd0eSchristos
2538*c42dbd0eSchristos4.4 The Viewmode
2539*c42dbd0eSchristos================
2540*c42dbd0eSchristos
2541*c42dbd0eSchristosThere are different ways to view a call stack in Java.  In 'gprofng',
2542*c42dbd0eSchristosthis is called the _viewmode_ and the setting is controlled through a
2543*c42dbd0eSchristoscommand with the same name.
2544*c42dbd0eSchristos
2545*c42dbd0eSchristos   The 'viewmode' command takes one of the following keywords:
2546*c42dbd0eSchristos
2547*c42dbd0eSchristos'user'
2548*c42dbd0eSchristos     This is the default and shows the Java call stacks for Java
2549*c42dbd0eSchristos     threads.  No call stacks for any housekeeping threads are shown.
2550*c42dbd0eSchristos     The function list contains a function '<JVM-System>' that
2551*c42dbd0eSchristos     represents the aggregated time from non-Java threads.  When the JVM
2552*c42dbd0eSchristos     software does not report a Java call stack, time is reported
2553*c42dbd0eSchristos     against the function '<no Java callstack recorded>'.
2554*c42dbd0eSchristos
2555*c42dbd0eSchristos'expert'
2556*c42dbd0eSchristos     Show the Java call stacks for Java threads when the Java code from
2557*c42dbd0eSchristos     the user is executed and machine call stacks when JVM code is
2558*c42dbd0eSchristos     executed, or when the JVM software does not report a Java call
2559*c42dbd0eSchristos     stack.  Show the machine call stacks for housekeeping threads.
2560*c42dbd0eSchristos
2561*c42dbd0eSchristos'machine'
2562*c42dbd0eSchristos     Show the actual native call stacks for all threads.
2563*c42dbd0eSchristos
2564*c42dbd0eSchristos
2565*c42dbd0eSchristosFile: gprofng.info,  Node: The Selection List,  Next: Load Objects and Functions,  Prev: The Viewmode,  Up: Terminology
2566*c42dbd0eSchristos
2567*c42dbd0eSchristos4.5 The Selection List
2568*c42dbd0eSchristos======================
2569*c42dbd0eSchristos
2570*c42dbd0eSchristosSeveral commands allow the user to specify a subset of a list.  For
2571*c42dbd0eSchristosexample, to select specific threads from all the threads that have been
2572*c42dbd0eSchristosused when conducting the experiment(s).
2573*c42dbd0eSchristos
2574*c42dbd0eSchristos   Such a selection list (or "list" in the remainder of this section)
2575*c42dbd0eSchristoscan be a single number, a contiguous range of numbers with the start and
2576*c42dbd0eSchristosend numbers separated by a hyphen ('-'), a comma-separated list of
2577*c42dbd0eSchristosnumbers and ranges, or the 'all' keyword.  Lists must not contain
2578*c42dbd0eSchristosspaces.
2579*c42dbd0eSchristos
2580*c42dbd0eSchristos   Each list can optionally be preceded by an experiment list with a
2581*c42dbd0eSchristossimilar format, separated from the list by a colon (:).  If no
2582*c42dbd0eSchristosexperiment list is included, the list applies to all experiments.
2583*c42dbd0eSchristos
2584*c42dbd0eSchristos   Multiple lists can be concatenated by separating the individual lists
2585*c42dbd0eSchristosby a plus sign.
2586*c42dbd0eSchristos
2587*c42dbd0eSchristos   These are some examples of various filters using a list:
2588*c42dbd0eSchristos
2589*c42dbd0eSchristos'thread_select 1'
2590*c42dbd0eSchristos     Select thread 1 from all experiments.
2591*c42dbd0eSchristos
2592*c42dbd0eSchristos'thread_select all:1'
2593*c42dbd0eSchristos     Select thread 1 from all experiments.
2594*c42dbd0eSchristos
2595*c42dbd0eSchristos'thread_select 1:1+2:2'
2596*c42dbd0eSchristos     Select thread 1 from experiment 1 and thread 2 from experiment 2.
2597*c42dbd0eSchristos
2598*c42dbd0eSchristos'cpu_select all:1,3,5'
2599*c42dbd0eSchristos     Selects cores 1, 3, and 5 from all experiments.
2600*c42dbd0eSchristos
2601*c42dbd0eSchristos'cpu_select 1,2:all'
2602*c42dbd0eSchristos     Select all cores from experiments 1 and 2, as listed by the 'by
2603*c42dbd0eSchristos     exp_list' command.
2604*c42dbd0eSchristos
2605*c42dbd0eSchristos
2606*c42dbd0eSchristosFile: gprofng.info,  Node: Load Objects and Functions,  Next: The Concept of a CPU in gprofng,  Prev: The Selection List,  Up: Terminology
2607*c42dbd0eSchristos
2608*c42dbd0eSchristos4.6 Load Objects and Functions
2609*c42dbd0eSchristos==============================
2610*c42dbd0eSchristos
2611*c42dbd0eSchristosAn application consists of various components.  The source code files
2612*c42dbd0eSchristosare compiled into object files.  These are then glued together at link
2613*c42dbd0eSchristostime to form the executable.  During execution, the program may also
2614*c42dbd0eSchristosdynamically load objects.
2615*c42dbd0eSchristos
2616*c42dbd0eSchristos   A _load object_ is defined to be an executable, or shared object.  A
2617*c42dbd0eSchristosshared library is an example of a load object in 'gprofng'.
2618*c42dbd0eSchristos
2619*c42dbd0eSchristos   Each load object, contains a text section with the instructions
2620*c42dbd0eSchristosgenerated by the compiler, a data section for data, and various symbol
2621*c42dbd0eSchristostables.  All load objects must contain an ELF symbol table, which gives
2622*c42dbd0eSchristosthe names and addresses of all the globally known functions in that
2623*c42dbd0eSchristosobject.
2624*c42dbd0eSchristos
2625*c42dbd0eSchristos   Load objects compiled with the -g option contain additional symbolic
2626*c42dbd0eSchristosinformation that can augment the ELF symbol table and provide
2627*c42dbd0eSchristosinformation about functions that are not global, additional information
2628*c42dbd0eSchristosabout object modules from which the functions came, and line number
2629*c42dbd0eSchristosinformation relating addresses to source lines.
2630*c42dbd0eSchristos
2631*c42dbd0eSchristos   The term _function_ is used to describe a set of instructions that
2632*c42dbd0eSchristosrepresent a high-level operation described in the source code.  The term
2633*c42dbd0eSchristosalso covers methods as used in C++ and in the Java programming language.
2634*c42dbd0eSchristos
2635*c42dbd0eSchristos   In the 'gprofng' context, functions are provided in source code
2636*c42dbd0eSchristosformat.  Normally their names appear in the symbol table representing a
2637*c42dbd0eSchristosset of addresses.  If the Program Counter (PC) is within that set, the
2638*c42dbd0eSchristosprogram is executing within that function.
2639*c42dbd0eSchristos
2640*c42dbd0eSchristos   In principle, any address within the text segment of a load object
2641*c42dbd0eSchristoscan be mapped to a function.  Exactly the same mapping is used for the
2642*c42dbd0eSchristosleaf PC and all the other PCs on the call stack.
2643*c42dbd0eSchristos
2644*c42dbd0eSchristos   Most of the functions correspond directly to the source model of the
2645*c42dbd0eSchristosprogram, but there are exceptions.  This topic is however outside of the
2646*c42dbd0eSchristosscope of this guide.
2647*c42dbd0eSchristos
2648*c42dbd0eSchristos
2649*c42dbd0eSchristosFile: gprofng.info,  Node: The Concept of a CPU in gprofng,  Next: Hardware Event Counters Explained,  Prev: Load Objects and Functions,  Up: Terminology
2650*c42dbd0eSchristos
2651*c42dbd0eSchristos4.7 The Concept of a CPU in gprofng
2652*c42dbd0eSchristos===================================
2653*c42dbd0eSchristos
2654*c42dbd0eSchristosIn gprofng, there is the concept of a CPU. Admittedly, this is not the
2655*c42dbd0eSchristosbest word to describe what is meant here and may be replaced in the
2656*c42dbd0eSchristosfuture.
2657*c42dbd0eSchristos
2658*c42dbd0eSchristos   The word CPU is used in many of the displays.  In the context of
2659*c42dbd0eSchristosgprofng, it is meant to denote a part of the processor that is capable
2660*c42dbd0eSchristosof executing instructions and with its own state, like the program
2661*c42dbd0eSchristoscounter.
2662*c42dbd0eSchristos
2663*c42dbd0eSchristos   For example, on a contemporary processor, a CPU could be a core.  In
2664*c42dbd0eSchristoscase hardware threads are supported within a core, it could be one of
2665*c42dbd0eSchristosthose hardware threads.
2666*c42dbd0eSchristos
2667*c42dbd0eSchristos
2668*c42dbd0eSchristosFile: gprofng.info,  Node: Hardware Event Counters Explained,  Next: apath,  Prev: The Concept of a CPU in gprofng,  Up: Terminology
2669*c42dbd0eSchristos
2670*c42dbd0eSchristos4.8 Hardware Event Counters Explained
2671*c42dbd0eSchristos=====================================
2672*c42dbd0eSchristos
2673*c42dbd0eSchristosFor quite a number of years now, many microprocessors have supported
2674*c42dbd0eSchristoshardware event counters.
2675*c42dbd0eSchristos
2676*c42dbd0eSchristos   On the hardware side, this means that in the processor there are one
2677*c42dbd0eSchristosor more registers dedicated to count certain activities, or "events".
2678*c42dbd0eSchristosExamples of such events are the number of instructions executed, or the
2679*c42dbd0eSchristosnumber of cache misses at level 2 in the memory hierarchy.
2680*c42dbd0eSchristos
2681*c42dbd0eSchristos   While there is a limited set of such registers, the user can map
2682*c42dbd0eSchristosevents onto them.  In case more than one register is available, this
2683*c42dbd0eSchristosallows for the simultaenous measurement of various events.
2684*c42dbd0eSchristos
2685*c42dbd0eSchristos   A simple, yet powerful, example is to simultaneously count the number
2686*c42dbd0eSchristosof CPU cycles and the number of instructions excuted.  These two numbers
2687*c42dbd0eSchristoscan then be used to compute the _IPC_ value.  IPC stands for
2688*c42dbd0eSchristos"Instructions Per Clockcycle" and each processor has a maximum.  For
2689*c42dbd0eSchristosexample, if this maximum number is 2, it means the processor is capable
2690*c42dbd0eSchristosof executing two instructions every clock cycle.
2691*c42dbd0eSchristos
2692*c42dbd0eSchristos   Whether this is actually achieved, depends on several factors,
2693*c42dbd0eSchristosincluding the instruction characteristics.  However, in case the IPC
2694*c42dbd0eSchristosvalue is well below this maximum in a time critical part of the
2695*c42dbd0eSchristosapplication and this cannot be easily explained, further investigation
2696*c42dbd0eSchristosis probably warranted.
2697*c42dbd0eSchristos
2698*c42dbd0eSchristos   A related metric is called _CPI_, or "Clockcycles Per Instruction".
2699*c42dbd0eSchristosIt is the inverse of the CPI and can be compared against the theoretical
2700*c42dbd0eSchristosvalue(s) of the target instruction(s).  A significant difference may
2701*c42dbd0eSchristospoint at a bottleneck.
2702*c42dbd0eSchristos
2703*c42dbd0eSchristos   One thing to keep in mind is that the value returned by a counter can
2704*c42dbd0eSchristoseither be the number of times the event occured, or a CPU cycle count.
2705*c42dbd0eSchristosIn case of the latter it is possible to convert this number to time.
2706*c42dbd0eSchristos
2707*c42dbd0eSchristos   This is often easier to interpret than a simple count, but there is
2708*c42dbd0eSchristosone caveat to keep in mind.  The CPU frequency may not have been
2709*c42dbd0eSchristosconstant while the experimen was recorded and this impacts the time
2710*c42dbd0eSchristosreported.
2711*c42dbd0eSchristos
2712*c42dbd0eSchristos   These event counters, or "counters" for short, provide great insight
2713*c42dbd0eSchristosinto what happens deep inside the processor.  In case higher level
2714*c42dbd0eSchristosinformation does not provide the insight needed, the counters provide
2715*c42dbd0eSchristosthe information to get to the bottom of a performance problem.
2716*c42dbd0eSchristos
2717*c42dbd0eSchristos   There are some things to consider though.
2718*c42dbd0eSchristos
2719*c42dbd0eSchristos   * The event definitions and names vary across processors and it may
2720*c42dbd0eSchristos     even happen that some events change with an update.  Unfortunately
2721*c42dbd0eSchristos     and this is luckily rare, there are sometimes bugs causing the
2722*c42dbd0eSchristos     wrong count to be returned.
2723*c42dbd0eSchristos
2724*c42dbd0eSchristos     In 'gprofng', some of the processor specific event names have an
2725*c42dbd0eSchristos     alias name.  For example 'insts' measures the instructions
2726*c42dbd0eSchristos     executed.  These aliases not only makes it easier to identify the
2727*c42dbd0eSchristos     functionality, but also provide portability of certain events
2728*c42dbd0eSchristos     across processors.
2729*c42dbd0eSchristos
2730*c42dbd0eSchristos   * Another complexity is that there are typically many events one can
2731*c42dbd0eSchristos     monitor.  There may up to hundreds of events available and it could
2732*c42dbd0eSchristos     require several experiments to zoom in on the root cause of a
2733*c42dbd0eSchristos     performance problem.
2734*c42dbd0eSchristos
2735*c42dbd0eSchristos   * There may be restrictions regarding the mapping of event(s) onto
2736*c42dbd0eSchristos     the counters.  For example, certain events may be restricted to
2737*c42dbd0eSchristos     specific counters only.  As a result, one may have to conduct
2738*c42dbd0eSchristos     additional experiments to cover all the events of interest.
2739*c42dbd0eSchristos
2740*c42dbd0eSchristos   * The names of the events may also not be easy to interpret.  In such
2741*c42dbd0eSchristos     cases, the description can be found in the architecture manual for
2742*c42dbd0eSchristos     the processor.
2743*c42dbd0eSchristos
2744*c42dbd0eSchristos   Despite these drawbacks, hardware event counters are extremely useful
2745*c42dbd0eSchristosand may even turn out to be indispensable.
2746*c42dbd0eSchristos
2747*c42dbd0eSchristos
2748*c42dbd0eSchristosFile: gprofng.info,  Node: apath,  Prev: Hardware Event Counters Explained,  Up: Terminology
2749*c42dbd0eSchristos
2750*c42dbd0eSchristos4.9 What is <apath>?
2751*c42dbd0eSchristos====================
2752*c42dbd0eSchristos
2753*c42dbd0eSchristosIn most cases, 'gprofng' shows the absolute pathnames of directories.
2754*c42dbd0eSchristosThese tend to be rather long, causing display issues in this document.
2755*c42dbd0eSchristos
2756*c42dbd0eSchristos   Instead of wrapping these long pathnames over multiple lines, we
2757*c42dbd0eSchristosdecided to represent them by the '<apath>' symbol, which stands for "an
2758*c42dbd0eSchristosabsolute pathname".
2759*c42dbd0eSchristos
2760*c42dbd0eSchristos   Note that different occurrences of '<apath>' may represent different
2761*c42dbd0eSchristosabsolute pathnames.
2762*c42dbd0eSchristos
2763*c42dbd0eSchristos
2764*c42dbd0eSchristosFile: gprofng.info,  Node: Other Document Formats,  Next: Index,  Prev: Terminology,  Up: Top
2765*c42dbd0eSchristos
2766*c42dbd0eSchristos5 Other Document Formats
2767*c42dbd0eSchristos************************
2768*c42dbd0eSchristos
2769*c42dbd0eSchristosThis document is written in Texinfo and the source text is made
2770*c42dbd0eSchristosavailable as part of the binutils distribution.  The file name is
2771*c42dbd0eSchristos'gprofng.texi' and can be found in subdirectory 'doc' under directory
2772*c42dbd0eSchristos'gprofng' in the top level directory.
2773*c42dbd0eSchristos
2774*c42dbd0eSchristos   This file can be used to generate the document in the 'info', 'html',
2775*c42dbd0eSchristosand 'pdf' formats.  The default installation procedure creates a file in
2776*c42dbd0eSchristosthe 'info' format and stores it in the documentation section of
2777*c42dbd0eSchristosbinutils.
2778*c42dbd0eSchristos
2779*c42dbd0eSchristos   The probably easiest way to generate a different format from this
2780*c42dbd0eSchristosTexinfo document is to go to the distribution directory that was created
2781*c42dbd0eSchristoswhen the tools were built.  This is either the default distribution
2782*c42dbd0eSchristosdirectory, or the one that has been set with the '--prefix' option as
2783*c42dbd0eSchristospart of the 'configure' command.  In this example we symbolize this
2784*c42dbd0eSchristoslocation with '<dist>'.
2785*c42dbd0eSchristos
2786*c42dbd0eSchristos   The make file called 'Makefile' in directory '<dist>/gprofng/doc'
2787*c42dbd0eSchristossupports several commands to generate this document in different
2788*c42dbd0eSchristosformats.  We recommend to use these commands.
2789*c42dbd0eSchristos
2790*c42dbd0eSchristos   They create the file(s) and install it in the documentation directory
2791*c42dbd0eSchristosof binutils, which is '<dist>/share/doc' in case 'html' or 'pdf' is
2792*c42dbd0eSchristosselected and '<dist>/share/info' for the file in the 'info' format.
2793*c42dbd0eSchristos
2794*c42dbd0eSchristos   To generate this document in the requested format and install it in
2795*c42dbd0eSchristosthe documentation directory, the commands below should be executed.  In
2796*c42dbd0eSchristosthis notation, '<format>' is one of 'info', 'html', or 'pdf':
2797*c42dbd0eSchristos
2798*c42dbd0eSchristos     $ cd <dist>/gprofng/doc
2799*c42dbd0eSchristos     $ make install-<format>
2800*c42dbd0eSchristos
2801*c42dbd0eSchristosSome things to note:
2802*c42dbd0eSchristos
2803*c42dbd0eSchristos   * For the 'pdf' file to be generated, the TeX document formatting
2804*c42dbd0eSchristos     software is required and the relevant commmands need to be included
2805*c42dbd0eSchristos     in the search path.  An example of a popular TeX implementation is
2806*c42dbd0eSchristos     _TexLive_.  It is beyond the scope of this document to go into the
2807*c42dbd0eSchristos     details of installing and using TeX, but it is well documented
2808*c42dbd0eSchristos     elsewhere.
2809*c42dbd0eSchristos
2810*c42dbd0eSchristos   * Instead of generating a single file in the 'html' format, it is
2811*c42dbd0eSchristos     also possible to create a directory with individual files for the
2812*c42dbd0eSchristos     various chapters.  To do so, remove the use of '--no-split' in
2813*c42dbd0eSchristos     variable 'MAKEINFOHTML' in the make file in the 'doc' directory.
2814*c42dbd0eSchristos
2815*c42dbd0eSchristos   * The make file also supports commands to only generate the file in
2816*c42dbd0eSchristos     the desired format and not move them to the documentation
2817*c42dbd0eSchristos     directory.  This is accomplished through the 'make <format>'
2818*c42dbd0eSchristos     command.
2819*c42dbd0eSchristos
2820*c42dbd0eSchristos
2821*c42dbd0eSchristosFile: gprofng.info,  Node: Index,  Prev: Other Document Formats,  Up: Top
2822*c42dbd0eSchristos
2823*c42dbd0eSchristosIndex
2824*c42dbd0eSchristos*****
2825*c42dbd0eSchristos
2826*c42dbd0eSchristos�[index�]
2827*c42dbd0eSchristos* Menu:
2828*c42dbd0eSchristos
2829*c42dbd0eSchristos* Command line mode:                     A First Profile.     (line  39)
2830*c42dbd0eSchristos* Commands, calltree:                    The Call Tree.       (line  10)
2831*c42dbd0eSchristos* Commands, compare delta:               Comparison of Experiments.
2832*c42dbd0eSchristos                                                              (line  52)
2833*c42dbd0eSchristos* Commands, compare on/off:              Comparison of Experiments.
2834*c42dbd0eSchristos                                                              (line   7)
2835*c42dbd0eSchristos* Commands, compare on/off <1>:          Comparison of Experiments.
2836*c42dbd0eSchristos                                                              (line  51)
2837*c42dbd0eSchristos* Commands, compare ratio:               Comparison of Experiments.
2838*c42dbd0eSchristos                                                              (line  52)
2839*c42dbd0eSchristos* Commands, compare ratio <1>:           Examples Using Hardware Event Counters.
2840*c42dbd0eSchristos                                                              (line 166)
2841*c42dbd0eSchristos* Commands, cpus:                        Commands Specific to Multithreading.
2842*c42dbd0eSchristos                                                              (line 177)
2843*c42dbd0eSchristos* Commands, cpu_list:                    Commands Specific to Multithreading.
2844*c42dbd0eSchristos                                                              (line 176)
2845*c42dbd0eSchristos* Commands, disasm:                      The Disassembly View.
2846*c42dbd0eSchristos                                                              (line   8)
2847*c42dbd0eSchristos* Commands, experiment_list:             Aggregation of Experiments.
2848*c42dbd0eSchristos                                                              (line  21)
2849*c42dbd0eSchristos* Commands, fsingle:                     Information on Load Objects.
2850*c42dbd0eSchristos                                                              (line  10)
2851*c42dbd0eSchristos* Commands, fsingle <1>:                 Information on Load Objects.
2852*c42dbd0eSchristos                                                              (line  36)
2853*c42dbd0eSchristos* Commands, fsummary:                    Information on Load Objects.
2854*c42dbd0eSchristos                                                              (line  10)
2855*c42dbd0eSchristos* Commands, fsummary <1>:                Information on Load Objects.
2856*c42dbd0eSchristos                                                              (line  62)
2857*c42dbd0eSchristos* Commands, functions:                   A First Profile.     (line  44)
2858*c42dbd0eSchristos* Commands, header:                      More Information on the Experiment.
2859*c42dbd0eSchristos                                                              (line  10)
2860*c42dbd0eSchristos* Commands, header <1>:                  Control the Sampling Frequency.
2861*c42dbd0eSchristos                                                              (line  59)
2862*c42dbd0eSchristos* Commands, limit:                       Control the Number of Lines in the Output.
2863*c42dbd0eSchristos                                                              (line   6)
2864*c42dbd0eSchristos* Commands, lines:                       The Source Code View.
2865*c42dbd0eSchristos                                                              (line  68)
2866*c42dbd0eSchristos* Commands, metrics:                     Display and Define the Metrics.
2867*c42dbd0eSchristos                                                              (line  11)
2868*c42dbd0eSchristos* Commands, metrics <1>:                 Display and Define the Metrics.
2869*c42dbd0eSchristos                                                              (line  38)
2870*c42dbd0eSchristos* Commands, metrics <2>:                 Metric Definitions.  (line   9)
2871*c42dbd0eSchristos* Commands, metric_list:                 Display and Define the Metrics.
2872*c42dbd0eSchristos                                                              (line  10)
2873*c42dbd0eSchristos* Commands, metric_list <1>:             Display and Define the Metrics.
2874*c42dbd0eSchristos                                                              (line  19)
2875*c42dbd0eSchristos* Commands, metric_list <2>:             Examples Using Hardware Event Counters.
2876*c42dbd0eSchristos                                                              (line 261)
2877*c42dbd0eSchristos* Commands, objects:                     Information on Load Objects.
2878*c42dbd0eSchristos                                                              (line  10)
2879*c42dbd0eSchristos* Commands, overview:                    More Information on the Experiment.
2880*c42dbd0eSchristos                                                              (line  54)
2881*c42dbd0eSchristos* Commands, pcs:                         The Disassembly View.
2882*c42dbd0eSchristos                                                              (line  73)
2883*c42dbd0eSchristos* Commands, script:                      Scripting.           (line  11)
2884*c42dbd0eSchristos* Commands, sort:                        Sorting the Performance Data.
2885*c42dbd0eSchristos                                                              (line   6)
2886*c42dbd0eSchristos* Commands, source:                      The Source Code View.
2887*c42dbd0eSchristos                                                              (line  12)
2888*c42dbd0eSchristos* Commands, threads:                     Commands Specific to Multithreading.
2889*c42dbd0eSchristos                                                              (line  35)
2890*c42dbd0eSchristos* Commands, thread_list:                 Commands Specific to Multithreading.
2891*c42dbd0eSchristos                                                              (line  10)
2892*c42dbd0eSchristos* Commands, thread_select:               Commands Specific to Multithreading.
2893*c42dbd0eSchristos                                                              (line  64)
2894*c42dbd0eSchristos* Commands, thread_select <1>:           Commands Specific to Multithreading.
2895*c42dbd0eSchristos                                                              (line  88)
2896*c42dbd0eSchristos* Commands, viewmode:                    Java Profiling.      (line  37)
2897*c42dbd0eSchristos* Commands, viewmode <1>:                The Viewmode.        (line   6)
2898*c42dbd0eSchristos* Compare experiments:                   Comparison of Experiments.
2899*c42dbd0eSchristos                                                              (line  10)
2900*c42dbd0eSchristos* CPI:                                   Hardware Event Counters Explained.
2901*c42dbd0eSchristos                                                              (line  31)
2902*c42dbd0eSchristos* CPU:                                   The Concept of a CPU in gprofng.
2903*c42dbd0eSchristos                                                              (line   6)
2904*c42dbd0eSchristos* Default metrics:                       Display and Define the Metrics.
2905*c42dbd0eSchristos                                                              (line  34)
2906*c42dbd0eSchristos* ELF:                                   Load Objects and Functions.
2907*c42dbd0eSchristos                                                              (line  16)
2908*c42dbd0eSchristos* Exclusive metric:                      Inclusive and Exclusive Metrics.
2909*c42dbd0eSchristos                                                              (line  14)
2910*c42dbd0eSchristos* Experiment directory:                  Steps Needed to Create a Profile.
2911*c42dbd0eSchristos                                                              (line  21)
2912*c42dbd0eSchristos* Filters, Thread selection:             Commands Specific to Multithreading.
2913*c42dbd0eSchristos                                                              (line  64)
2914*c42dbd0eSchristos* Flavor field:                          Metric Definitions.  (line  13)
2915*c42dbd0eSchristos* Function:                              Load Objects and Functions.
2916*c42dbd0eSchristos                                                              (line  26)
2917*c42dbd0eSchristos* gprofng display html:                  Steps Needed to Create a Profile.
2918*c42dbd0eSchristos                                                              (line  46)
2919*c42dbd0eSchristos* gprofng display text:                  Steps Needed to Create a Profile.
2920*c42dbd0eSchristos                                                              (line  42)
2921*c42dbd0eSchristos* Hardware event counters, alias name:   Hardware Event Counters Explained.
2922*c42dbd0eSchristos                                                              (line  57)
2923*c42dbd0eSchristos* Hardware event counters, auto option:  Examples Using Hardware Event Counters.
2924*c42dbd0eSchristos                                                              (line  21)
2925*c42dbd0eSchristos* Hardware event counters, counter definition: Getting Information on the Counters Supported.
2926*c42dbd0eSchristos                                                              (line  85)
2927*c42dbd0eSchristos* Hardware event counters, description:  Hardware Event Counters Explained.
2928*c42dbd0eSchristos                                                              (line   6)
2929*c42dbd0eSchristos* Hardware event counters, hwc metric:   Examples Using Hardware Event Counters.
2930*c42dbd0eSchristos                                                              (line  31)
2931*c42dbd0eSchristos* Hardware event counters, IPC:          Examples Using Hardware Event Counters.
2932*c42dbd0eSchristos                                                              (line 250)
2933*c42dbd0eSchristos* Hardware event counters, variable CPU frequency: Hardware Event Counters Explained.
2934*c42dbd0eSchristos                                                              (line  40)
2935*c42dbd0eSchristos* Inclusive metric:                      Inclusive and Exclusive Metrics.
2936*c42dbd0eSchristos                                                              (line   9)
2937*c42dbd0eSchristos* Instruction level timings:             The Disassembly View.
2938*c42dbd0eSchristos                                                              (line   9)
2939*c42dbd0eSchristos* Instruction pointer:                   The Program Counter. (line  10)
2940*c42dbd0eSchristos* Interpreter mode:                      A First Profile.     (line  30)
2941*c42dbd0eSchristos* IPC:                                   Hardware Event Counters Explained.
2942*c42dbd0eSchristos                                                              (line  20)
2943*c42dbd0eSchristos* Java profiling, -J <string>:           Java Profiling.      (line  29)
2944*c42dbd0eSchristos* Java profiling, -j on/off:             Java Profiling.      (line   6)
2945*c42dbd0eSchristos* Java profiling, <JVM-System>:          The Viewmode.        (line  15)
2946*c42dbd0eSchristos* Java profiling, <no Java callstack recorded>: The Viewmode. (line  18)
2947*c42dbd0eSchristos* Java profiling, different view modes:  Java Profiling.      (line  37)
2948*c42dbd0eSchristos* Java profiling, JAVA_PATH:             Java Profiling.      (line  24)
2949*c42dbd0eSchristos* Java profiling, JDK_HOME:              Java Profiling.      (line  23)
2950*c42dbd0eSchristos* Leaf function:                         Inclusive and Exclusive Metrics.
2951*c42dbd0eSchristos                                                              (line  19)
2952*c42dbd0eSchristos* List specification:                    The Selection List.  (line   6)
2953*c42dbd0eSchristos* Load object:                           Load Objects and Functions.
2954*c42dbd0eSchristos                                                              (line  11)
2955*c42dbd0eSchristos* Load objects:                          Information on Load Objects.
2956*c42dbd0eSchristos                                                              (line  11)
2957*c42dbd0eSchristos* Metric name field:                     Metric Definitions.  (line  13)
2958*c42dbd0eSchristos* Miscellaneous , ##:                    The Source Code View.
2959*c42dbd0eSchristos                                                              (line  63)
2960*c42dbd0eSchristos* Miscellaneous, <apath>:                Information on Load Objects.
2961*c42dbd0eSchristos                                                              (line  16)
2962*c42dbd0eSchristos* Miscellaneous, <Total>:                A First Profile.     (line  86)
2963*c42dbd0eSchristos* mxv-pthreads.exe:                      The Example Program. (line  12)
2964*c42dbd0eSchristos* Options, -C:                           More Information on the Experiment.
2965*c42dbd0eSchristos                                                              (line  48)
2966*c42dbd0eSchristos* Options, -h:                           Getting Information on the Counters Supported.
2967*c42dbd0eSchristos                                                              (line   9)
2968*c42dbd0eSchristos* Options, -h <1>:                       Examples Using Hardware Event Counters.
2969*c42dbd0eSchristos                                                              (line  21)
2970*c42dbd0eSchristos* Options, -o:                           Name the Experiment Directory.
2971*c42dbd0eSchristos                                                              (line  15)
2972*c42dbd0eSchristos* Options, -O:                           Name the Experiment Directory.
2973*c42dbd0eSchristos                                                              (line  19)
2974*c42dbd0eSchristos* Options, -O <1>:                       A More Elaborate Example.
2975*c42dbd0eSchristos                                                              (line  12)
2976*c42dbd0eSchristos* Options, -p:                           The Call Tree.       (line  66)
2977*c42dbd0eSchristos* Options, -p <1>:                       Control the Sampling Frequency.
2978*c42dbd0eSchristos                                                              (line  18)
2979*c42dbd0eSchristos* PC:                                    The Program Counter. (line   6)
2980*c42dbd0eSchristos* PC <1>:                                Load Objects and Functions.
2981*c42dbd0eSchristos                                                              (line  32)
2982*c42dbd0eSchristos* PC sampling:                           Sampling versus Tracing.
2983*c42dbd0eSchristos                                                              (line   7)
2984*c42dbd0eSchristos* Posix Threads:                         The Example Program. (line   8)
2985*c42dbd0eSchristos* Program Counter:                       The Program Counter. (line   6)
2986*c42dbd0eSchristos* Program Counter <1>:                   Load Objects and Functions.
2987*c42dbd0eSchristos                                                              (line  32)
2988*c42dbd0eSchristos* Program Counter sampling:              Sampling versus Tracing.
2989*c42dbd0eSchristos                                                              (line   7)
2990*c42dbd0eSchristos* Pthreads:                              The Example Program. (line   8)
2991*c42dbd0eSchristos* Sampling interval:                     Control the Sampling Frequency.
2992*c42dbd0eSchristos                                                              (line  20)
2993*c42dbd0eSchristos* Selection list:                        The Selection List.  (line   6)
2994*c42dbd0eSchristos* Source level timings:                  The Source Code View.
2995*c42dbd0eSchristos                                                              (line  10)
2996*c42dbd0eSchristos* TeX:                                   Other Document Formats.
2997*c42dbd0eSchristos                                                              (line  40)
2998*c42dbd0eSchristos* Thread affinity:                       Commands Specific to Multithreading.
2999*c42dbd0eSchristos                                                              (line 162)
3000*c42dbd0eSchristos* Total CPU time:                        A First Profile.     (line  74)
3001*c42dbd0eSchristos* Viewmode:                              The Viewmode.        (line   6)
3002*c42dbd0eSchristos* Visibility field:                      Sorting the Performance Data.
3003*c42dbd0eSchristos                                                              (line   9)
3004*c42dbd0eSchristos* Visibility field <1>:                  Metric Definitions.  (line  13)
3005*c42dbd0eSchristos
3006*c42dbd0eSchristos
3007*c42dbd0eSchristos
3008*c42dbd0eSchristosTag Table:
3009*c42dbd0eSchristosNode: Top750
3010*c42dbd0eSchristosNode: Introduction3075
3011*c42dbd0eSchristosNode: Overview4352
3012*c42dbd0eSchristosNode: Main Features4993
3013*c42dbd0eSchristosNode: Sampling versus Tracing6733
3014*c42dbd0eSchristosNode: Steps Needed to Create a Profile9124
3015*c42dbd0eSchristosNode: A Mini Tutorial11311
3016*c42dbd0eSchristosNode: Getting Started12030
3017*c42dbd0eSchristosNode: The Example Program13680
3018*c42dbd0eSchristosNode: A First Profile14800
3019*c42dbd0eSchristosNode: The Source Code View19305
3020*c42dbd0eSchristosNode: The Disassembly View23801
3021*c42dbd0eSchristosNode: Display and Define the Metrics28490
3022*c42dbd0eSchristosNode: A First Customization of the Output30326
3023*c42dbd0eSchristosNode: Name the Experiment Directory32175
3024*c42dbd0eSchristosNode: Control the Number of Lines in the Output33272
3025*c42dbd0eSchristosNode: Sorting the Performance Data34031
3026*c42dbd0eSchristosNode: Scripting34973
3027*c42dbd0eSchristosNode: A More Elaborate Example35892
3028*c42dbd0eSchristosNode: The Call Tree38682
3029*c42dbd0eSchristosNode: More Information on the Experiment41423
3030*c42dbd0eSchristosNode: Control the Sampling Frequency44722
3031*c42dbd0eSchristosNode: Information on Load Objects47143
3032*c42dbd0eSchristosNode: Support for Multithreading50782
3033*c42dbd0eSchristosNode: Creating a Multithreading Experiment51422
3034*c42dbd0eSchristosNode: Commands Specific to Multithreading52935
3035*c42dbd0eSchristosNode: Viewing Multiple Experiments62341
3036*c42dbd0eSchristosNode: Aggregation of Experiments62918
3037*c42dbd0eSchristosNode: Comparison of Experiments65232
3038*c42dbd0eSchristosNode: Profile Hardware Event Counters69848
3039*c42dbd0eSchristosNode: Getting Information on the Counters Supported70556
3040*c42dbd0eSchristosNode: Examples Using Hardware Event Counters78156
3041*c42dbd0eSchristosNode: Java Profiling95853
3042*c42dbd0eSchristosNode: Terminology98222
3043*c42dbd0eSchristosNode: The Program Counter99234
3044*c42dbd0eSchristosNode: Inclusive and Exclusive Metrics99726
3045*c42dbd0eSchristosNode: Metric Definitions101179
3046*c42dbd0eSchristosNode: The Viewmode102364
3047*c42dbd0eSchristosNode: The Selection List103503
3048*c42dbd0eSchristosNode: Load Objects and Functions104905
3049*c42dbd0eSchristosNode: The Concept of a CPU in gprofng106919
3050*c42dbd0eSchristosNode: Hardware Event Counters Explained107680
3051*c42dbd0eSchristosNode: apath111472
3052*c42dbd0eSchristosNode: Other Document Formats112007
3053*c42dbd0eSchristosNode: Index114531
3054*c42dbd0eSchristos
3055*c42dbd0eSchristosEnd Tag Table
3056*c42dbd0eSchristos
3057*c42dbd0eSchristos
3058*c42dbd0eSchristosLocal Variables:
3059*c42dbd0eSchristoscoding: utf-8
3060*c42dbd0eSchristosEnd:
3061