xref: /netbsd-src/external/gpl3/binutils.old/dist/gprofng/doc/gprofng.info (revision c42dbd0ed2e61fe6eda8590caa852ccf34719964)
1This is gprofng.info, produced by makeinfo version 6.8 from
2gprofng.texi.
3
4This document is the manual for gprofng, last updated 22 February 2022.
5
6   Copyright (C) 2022 Free Software Foundation, Inc.
7
8   Permission is granted to copy, distribute and/or modify this document
9under the terms of the GNU Free Documentation License, Version 1.3 or
10any later version published by the Free Software Foundation; with no
11Invariant Sections, with no Front-Cover texts, and with no Back-Cover
12Texts.  A copy of the license is included in the section entitled "GNU
13Free Documentation License."
14
15INFO-DIR-SECTION Software development
16START-INFO-DIR-ENTRY
17* gprofng: (gprofng).                    The next generation profiling tool for Linux
18END-INFO-DIR-ENTRY
19
20
21File: gprofng.info,  Node: Top,  Next: Introduction,  Up: (dir)
22
23GNU Gprofng
24***********
25
26This document is the manual for gprofng, last updated 22 February 2022.
27
28   Copyright (C) 2022 Free Software Foundation, Inc.
29
30   Permission is granted to copy, distribute and/or modify this document
31under the terms of the GNU Free Documentation License, Version 1.3 or
32any later version published by the Free Software Foundation; with no
33Invariant Sections, with no Front-Cover texts, and with no Back-Cover
34Texts.  A copy of the license is included in the section entitled "GNU
35Free Documentation License."
36
37* Menu:
38
39* Introduction::           About this manual.
40* Overview::               A brief overview of gprofng.
41* A Mini Tutorial::        A short tutorial covering the key features.
42* Terminology::            Various concepts and some terminology explained.
43* Other Document Formats:: How to create this document in other formats.
44* Index::                  The index.
45
46
47-- The Detailed Node Listing --
48
49Introduction
50
51Overview
52
53* Main Features::                     A high level overview.
54* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
55* Steps Needed to Create a Profile::  How to create a profile.
56
57A Mini Tutorial
58
59* Getting Started::                 The basics of profiling with gprofng().
60* Support for Multithreading::      Commands specific to multithreaded applications.
61* Viewing Multiple Experiments::    Analyze multiple experiments.
62* Profile Hardware Event Counters:: How to use hardware event counters.
63* Java Profiling::                  How to profile a Java application.
64
65Terminology
66
67* The Program Counter::                    What is a Program Counter?
68* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
69* Metric Definitions::                     Definitions associated with metrics.
70* The Viewmode::                           Select the way call stacks are presented.
71* The Selection List::                     How to define a selection.
72* Load Objects and Functions::             The components in an application.
73* The Concept of a CPU in gprofng:: The definition of a CPU.
74* Hardware Event Counters Explained::      What are event counters?
75* apath::                                  Our generic definition of a path.
76
77
78
79
80File: gprofng.info,  Node: Introduction,  Next: Overview,  Prev: Top,  Up: Top
81
821 Introduction
83**************
84
85The gprofng tool is the next generation profiler for Linux.  It consists
86of various commands to generate and display profile information.
87
88   This manual starts with a tutorial how to create and interpret a
89profile.  This part is highly practical and has the goal to get users up
90to speed as quickly as possible.  As soon as possible, we would like to
91show you how to get your first profile on your screen.
92
93   This is followed by more examples, covering many of the features.  At
94the end of this tutorial, you should feel confident enough to tackle the
95more complex tasks.
96
97   In a future update a more formal reference manual will be included as
98well.  Since even in this tutorial we use certain terminology, we have
99included a chapter with descriptions at the end.  In case you encounter
100unfamiliar wordings or terminology, please check this chapter.
101
102   One word of caution.  In several cases we had to somewhat tweak the
103screen output in order to make it fit.  This is why the output may look
104somewhat different when you try things yourself.
105
106   For now, we wish you a smooth profiling experience with gprofng and
107good luck tackling performance bottlenecks.
108
109
110File: gprofng.info,  Node: Overview,  Next: A Mini Tutorial,  Prev: Introduction,  Up: Top
111
1122 A Brief Overview of gprofng
113*****************************
114
115* Menu:
116
117* Main Features::                     A high level overview.
118* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
119* Steps Needed to Create a Profile::  How to create a profile.
120
121Before we cover this tool in quite some detail, we start with a brief
122overview of what it is, and the main features.  Since we know that many
123of you would like to get started rightaway, already in this first
124chapter we explain the basics of profiling with 'gprofng'.
125
126
127File: gprofng.info,  Node: Main Features,  Next: Sampling versus Tracing,  Up: Overview
128
1292.1 Main Features
130=================
131
132These are the main features of the gprofng tool:
133
134   * Profiling is supported for an application written in C, C++, Java,
135     or Scala.
136
137   * Shared libraries are supported.  The information is presented at
138     the instruction level.
139
140   * The following multithreading programming models are supported:
141     Pthreads, OpenMP, and Java threads.
142
143   * This tool works with unmodified production level executables.
144     There is no need to recompile the code, but if the '-g' option has
145     been used when building the application, source line level
146     information is available.
147
148   * The focus is on support for code generated with the 'gcc' compiler,
149     but there is some limited support for the 'icc' compiler as well.
150     Future improvements and enhancements will focus on 'gcc' though.
151
152   * Processors from Intel, AMD, and Arm are supported, but the level of
153     support depends on the architectural details.  In particular,
154     hardware event counters may not be supported.
155
156   * Several views into the data are supported.  For example, a function
157     overview where the time is spent, but also a source line,
158     disassembly, call tree and a caller-callees overview are available.
159
160   * Through filters, the user can zoom in on an area of interest.
161
162   * Two or more profiles can be aggregated, or used in a comparison.
163     This comparison can be obtained at the function, source line, and
164     disassembly level.
165
166   * Through a scripting language, and customization of the metrics
167     shown, the generation and creation of a profile can be fully
168     automated and provide tailored output.
169
170
171File: gprofng.info,  Node: Sampling versus Tracing,  Next: Steps Needed to Create a Profile,  Prev: Main Features,  Up: Overview
172
1732.2 Sampling versus Tracing
174===========================
175
176A key difference with some other profiling tools is that the main data
177collection command 'gprofng collect app' mostly uses Program Counter
178(PC) sampling under the hood.
179
180   With _sampling_, the executable is stopped at regular intervals.
181Each time it is halted, key information is gathered and stored.  This
182includes the Program Counter that keeps track of where the execution is.
183Hence the name.
184
185   Together with operational data, this information is stored in the
186experiment directory and can be viewed in the second phase.
187
188   For example, the PC information is used to derive where the program
189was when it was halted.  Since the sampling interval is known, it is
190relatively easy to derive how much time was spent in the various parts
191of the program.
192
193   The opposite technique is generally referred to as _tracing_.  With
194tracing, the target is instrumented with specific calls that collect the
195requested information.
196
197   These are some of the pros and cons of PC sampling verus tracing:
198
199   * Since there is no need to recompile, existing executables can be
200     used and the profile measures the behaviour of exactly the same
201     executable that is used in production runs.
202
203     With sampling, one inherently profiles a different executable
204     because the calls to the instrumentation library may affect the
205     compiler optimizations and run time behaviour.
206
207   * With sampling, there are very few restrictions on what can be
208     profiled and even without access to the source code, a basic
209     profile can be made.
210
211   * A downside of sampling is that, depending on the sampling
212     frequency, small functions may be missed or not captured
213     accurately.  Although this is rare, this may happen and is the
214     reason why the user has control over the sampling rate.
215
216   * While tracing produces precise information, sampling is statistical
217     in nature.  As a result, small variations may occur across
218     seemingly identical runs.  We have not observed more than a few
219     percent deviation though.  Especially if the target job executed
220     for a sufficiently long time.
221
222   * With sampling, it is not possible to get an accurate count how
223     often functions are called.
224
225
226File: gprofng.info,  Node: Steps Needed to Create a Profile,  Prev: Sampling versus Tracing,  Up: Overview
227
2282.3 Steps Needed to Create a Profile
229====================================
230
231Creating a profile takes two steps.  First the profile data needs to be
232generated.  This is followed by a viewing step to create a report from
233the information that has been gathered.
234
235   Every gprofng command starts with 'gprofng', the name of the driver.
236This is followed by a keyword to define the high level functionality.
237Depending on this keyword, a third qualifier may be needed to further
238narrow down the request.  This combination is then followed by options
239that are specific to the functionality desired.
240
241   The command to gather, or "collect", the performance data is called
242'gprofng collect app'.  Aside from numerous options, this command takes
243the name of the target executable as an input parameter.
244
245   Upon completion of the run, the performance data can be found in the
246newly created experiment directory.
247
248   Unless explicitly specified otherwise, a default name for this
249directory is chosen.  The name is 'test.<n>.er' where 'n' is the first
250integer number not in use yet for such a name.
251
252   For example, the first time 'gprofng collect app' is invoked, an
253experiment directory with the name 'test.1.er' is created.
254
255   Upon a subsequent invocation of 'gprofng collect app' in the same
256directory, an experiment directory with the name 'test.2.er' will be
257created, and so forth.
258
259   Note that 'gprofng collect app' supports an option to explicitly name
260the experiment directory.  Outside of the restriction that the name of
261this directory has to end with '.er', any valid directory name can be
262used for this.
263
264   Now that we have the performance data, the next step is to display
265it.
266
267   The most commonly used command to view the performance information is
268'gprofng display text'.  This is a very extensive and customizable tool
269that produces the information in ASCII format.
270
271   Another option is to use 'gprofng display html'.  This tool generates
272a directory with files in html format.  These can be viewed in a
273browser, allowing for easy navigation through the profile data.
274
275
276File: gprofng.info,  Node: A Mini Tutorial,  Next: Terminology,  Prev: Overview,  Up: Top
277
2783 A Mini Tutorial
279*****************
280
281In this chapter we present and discuss the main functionality of
282'gprofng'.  This will be a practical approach, using an example code to
283generate profile data and show how to get various performance reports.
284
285* Menu:
286
287* Getting Started::                 The basics of profiling with gprofng().
288* Support for Multithreading::      Commands specific to multithreaded applications.
289* Viewing Multiple Experiments::    Analyze multiple experiments.
290* Profile Hardware Event Counters:: How to use hardware event counters.
291* Java Profiling::                  How to profile a Java application.
292
293
294File: gprofng.info,  Node: Getting Started,  Next: Support for Multithreading,  Up: A Mini Tutorial
295
2963.1 Getting Started
297===================
298
299The information presented here provides a good and common basis for many
300profiling tasks, but there are more features that you may want to
301leverage.
302
303   These are covered in subsequent sections in this chapter.
304
305* Menu:
306
307* The Example Program::                        A description of the example program used.
308* A First Profile::                            How to get the first profile.
309* The Source Code View::                       Display the metrics in the source code.
310* The Disassembly View::                       Display the metrics at the instruction level.
311* Display and Define the Metrics::             An example how to customize the metrics.
312* A First Customization of the Output::        An example how to customize the output.
313* Name the Experiment Directory::              Change the name of the experiment directory.
314* Control the Number of Lines in the Output::  Change the number of lines in the tables.
315* Sorting the Performance Data::               How to set the metric to sort by.
316* Scripting::                                  Use a script to execute the commands.
317* A More Elaborate Example::                   An example of customization.
318* The Call Tree::                              Display the dynamic call tree.
319* More Information on the Experiment::         How to get additional statistics.
320* Control the Sampling Frequency::             How to control the sampling granularity.
321* Information on Load Objects::                How to get more information on load objects.
322
323
324File: gprofng.info,  Node: The Example Program,  Next: A First Profile,  Up: Getting Started
325
3263.1.1 The Example Program
327-------------------------
328
329Throughout this guide we use the same example C code that implements the
330multiplication of a vector of length n by an m by n matrix.  The result
331is stored in a vector of length m.  The algorithm has been parallelized
332using Posix Threads, or Pthreads for short.
333
334   The code was built using the 'gcc' compiler and the name of the
335executable is mxv-pthreads.exe.
336
337   The matrix sizes can be set through the '-m' and '-n' options.  The
338number of threads is set with the '-t' option.  To increase the duration
339of the run, the multiplication is executed repeatedly.
340
341   This is an example that multiplies a 3000 by 2000 matrix with a
342vector of length 2000 using 2 threads:
343
344     $ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2
345     mxv: error check passed - rows = 3000 columns = 2000 threads = 2
346     $
347
348   The program performs an internal check to verify the results are
349correct.  The result of this check is printed, followed by the matrix
350sizes and the number of threads used.
351
352
353File: gprofng.info,  Node: A First Profile,  Next: The Source Code View,  Prev: The Example Program,  Up: Getting Started
354
3553.1.2 A First Profile
356---------------------
357
358The first step is to collect the performance data.  It is important to
359remember that much more information is gathered than may be shown by
360default.  Often a single data collection run is sufficient to get a lot
361of insight.
362
363   The 'gprofng collect app' command is used for the data collection.
364Nothing needs to be changed in the way the application is executed.  The
365only difference is that it is now run under control of the tool, as
366shown below:
367
368     $ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1
369
370   This command produces the following output:
371
372     Creating experiment database test.1.er (Process ID: 2416504) ...
373     mxv: error check passed - rows = 3000 columns = 2000 threads = 1
374
375   We see the message that a directory with the name 'test.1.er' has
376been created.  The application then completes as usual and we have our
377first experiment directory that can be analyzed.
378
379   The tool we use for this is called 'gprofng display text'.  It takes
380the name of the experiment directory as an argument.
381
382   If invoked this way, the tool starts in the interactive _interpreter_
383mode.  While in this environment, commands can be given and the tool
384responds.  This is illustrated below:
385
386     $ gprofng display text test.1.er
387     Warning: History and command editing is not supported on this system.
388     (gp-display-text) quit
389     $
390
391   While useful in certain cases, we prefer to use this tool in command
392line mode, by specifying the commands to be issued when invoking the
393tool.  The way to do this is to prepend the command with a hyphen ('-')
394if used on the command line.
395
396   For example, with the 'functions' command we request a list of the
397functions that have been executed and their respective CPU times:
398
399     $ gprofng display text -functions test.1.er
400
401     $ gprofng display text -functions test.1.er
402     Functions sorted by metric: Exclusive Total CPU Time
403
404     Excl.     Incl.      Name
405     Total     Total
406     CPU sec.  CPU sec.
407     2.272     2.272      <Total>
408     2.160     2.160      mxv_core
409     0.047     0.103      init_data
410     0.030     0.043      erand48_r
411     0.013     0.013      __drand48_iterate
412     0.013     0.056      drand48
413     0.008     0.010      _int_malloc
414     0.001     0.001      brk
415     0.001     0.002      sysmalloc
416     0.        0.001      __default_morecore
417     0.        0.113      __libc_start_main
418     0.        0.010      allocate_data
419     0.        2.160      collector_root
420     0.        2.160      driver_mxv
421     0.        0.113      main
422     0.        0.010      malloc
423     0.        0.001      sbrk
424
425   As easy and simple as these steps are, we do have a first profile of
426our program!  There are three columns.  The first two contain the _Total
427CPU Time_, which is the sum of the user and system time.  *Note
428Inclusive and Exclusive Metrics:: for an explanation of "exclusive" and
429"inclusive" times.
430
431   The first line echoes the metric that is used to sort the output.  By
432default, this is the exclusive CPU time, but the sort metric can be
433changed by the user.
434
435   We then see three columns with the exclusive and inclusive CPU times,
436plus the name of the function.
437
438   The function with the name '<Total>' is not a user function, but is
439introduced by 'gprofng' and is used to display the accumulated metric
440values.  In this case, we see that the total CPU time of this job was
441'2.272' seconds.
442
443   With '2.160' seconds, function 'mxv_core' is the most time consuming
444function.  It is also a leaf function.
445
446   The next function in the list is 'init_data'.  Although the CPU time
447spent in this part is negligible, this is an interesting entry because
448the inclusive CPU time of '0.103' seconds is higher than the exclusive
449CPU time of '0.047' seconds.  Clearly it is calling another function, or
450even more than one function.  *Note The Call Tree:: for the details how
451to get more information on this.
452
453   The function 'collector_root' does not look familiar.  It is one of
454the internal functions used by 'gprofng collect app' and can be ignored.
455While the inclusive time is high, the exclusive time is zero.  This
456means it doesn't contribute to the performance.
457
458   The question is how we know where this function originates from?
459There is a very useful command to get more details on a function.  *Note
460Information on Load Objects::.
461
462
463File: gprofng.info,  Node: The Source Code View,  Next: The Disassembly View,  Prev: A First Profile,  Up: Getting Started
464
4653.1.3 The Source Code View
466--------------------------
467
468In general, you would like to focus the tuning efforts on the most time
469consuming part(s) of the program.  In this case that is easy, since
4702.160 seconds on a total of 2.272 seconds is spent in function
471'mxv_core'.  That is 95% of the total and it is time to dig deeper and
472look at the time distribution at the source code level.
473
474   The 'source' command is used to accomplish this.  It takes the name
475of the function, not the source filename, as an argument.  This is
476demonstrated below, where the 'gprofng display text' command is used to
477show the annotated source listing of function 'mxv_core'.
478
479   Please note that the source code has to be compiled with the '-g'
480option in order for the source code feature to work.  Otherwise the
481location can not be determined.
482
483     $ gprofng display text -source mxv_core test.1.er
484
485   The slightly modified output is as follows:
486
487     Source file: <apath>/mxv.c
488     Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
489     Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
490
491        Excl.     Incl.
492        Total     Total
493        CPU sec.  CPU sec.
494
495        <lines deleted>
496                                    <Function: mxv_core>
497        0.        0.             32. void __attribute__ ((noinline))
498                                     mxv_core (
499                                     uint64_t row_index_start,
500                                     uint64_t row_index_end,
501                                     uint64_t m, uint64_t n,
502                                     double **restrict A,
503                                     double *restrict b,
504                                     double *restrict c)
505        0.        0.             33. {
506        0.        0.             34.    for (uint64_t i=row_index_start;
507                                             i<=row_index_end; i++) {
508        0.        0.             35.       double row_sum = 0.0;
509     ## 1.687     1.687          36.       for (int64_t j=0; j<n; j++)
510        0.473     0.473          37.          row_sum += A[i][j]*b[j];
511        0.        0.             38.       c[i] = row_sum;
512                                 39.    }
513        0.        0.             40. }
514
515   The first three lines provide information on the location of the
516source file, the object file and the load object (*Note Load Objects and
517Functions::).
518
519   Function 'mxv_core' is part of a source file that has other functions
520as well.  These functions will be shown, but without timing information.
521They have been removed in the output shown above.
522
523   This is followed by the annotated source code listing.  The selected
524metrics are shown first, followed by a source line number, and the
525source code.  The most time consuming line(s) are marked with the '##'
526symbol.  In this way they are easier to find.
527
528   What we see is that all of the time is spent in lines 36-37.
529
530   A related command sometimes comes handy as well.  It is called
531'lines' and displays a list of the source lines and their metrics,
532ordered according to the current sort metric (*Note Sorting the
533Performance Data::).
534
535   Below the command and the output.  For lay-out reasons, only the top
53610 is shown here and the last part of the text on some lines has been
537replaced by dots.
538
539     $ gprofng display text -lines test.1.er
540
541     Lines sorted by metric: Exclusive Total CPU Time
542
543     Excl.     Incl.  Name
544     Total     Total
545     CPU sec.  CPU sec.
546     2.272     2.272  <Total>
547     1.687     1.687  mxv_core, line 36 in "mxv.c"
548     0.473     0.473  mxv_core, line 37 in "mxv.c"
549     0.032     0.088  init_data, line 72 in "manage_data.c"
550     0.030     0.043  <Function: erand48_r, instructions without line numbers>
551     0.013     0.013  <Function: __drand48_iterate, instructions without ...>
552     0.013     0.056  <Function: drand48, instructions without line numbers>
553     0.012     0.012  init_data, line 77 in "manage_data.c"
554     0.008     0.010  <Function: _int_malloc, instructions without ...>
555     0.003     0.003  init_data, line 71 in "manage_data.c"
556
557   What this overview immediately highlights is that the next most time
558consuming source line takes 0.032 seconds only.  With an inclusive time
559of 0.088 seconds, it is also clear that this branch of the code does not
560impact the performance.
561
562
563File: gprofng.info,  Node: The Disassembly View,  Next: Display and Define the Metrics,  Prev: The Source Code View,  Up: Getting Started
564
5653.1.4 The Disassembly View
566--------------------------
567
568The source view is very useful to obtain more insight where the time is
569spent, but sometimes this is not sufficient.  This is when the
570disassembly view comes in.  It is activated with the 'disasm' command
571and as with the source view, it displays an annotated listing.  In this
572case it shows the instructions with the metrics, interleaved with the
573source lines.  The instructions have a reference in square brackets ('['
574and ']') to the source line they correspond to.
575
576This is what we get for our example:
577
578     $ gprofng display text -disasm mxv_core test.1.er
579
580     Source file: <apath>/mxv.c
581     Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
582     Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
583
584        Excl.     Incl.
585        Total     Total
586        CPU sec.  CPU sec.
587
588        <lines deleted>
589                             32. void __attribute__ ((noinline))
590                                 mxv_core (
591                                 uint64_t row_index_start,
592                                 uint64_t row_index_end,
593                                 uint64_t m, uint64_t n,
594                                 double **restrict A,
595                                 double *restrict b,
596                                 double *restrict c)
597                             33. {
598                                 <Function: mxv_core>
599        0.        0.             [33]   4021ba:  mov    0x8(%rsp),%r10
600                             34.    for (uint64_t i=row_index_start;
601                                         i<=row_index_end; i++) {
602        0.        0.             [34]   4021bf:  cmp    %rsi,%rdi
603        0.        0.             [34]   4021c2:  jbe    0x37
604        0.        0.             [34]   4021c4:  ret
605                             35.        double row_sum = 0.0;
606                             36.        for (int64_t j=0; j<n; j++)
607                             37.           row_sum += A[i][j]*b[j];
608        0.        0.             [37]   4021c5:  mov    (%r8,%rdi,8),%rdx
609        0.        0.             [36]   4021c9:  mov    $0x0,%eax
610        0.        0.             [35]   4021ce:  pxor   %xmm1,%xmm1
611        0.002     0.002          [37]   4021d2:  movsd  (%rdx,%rax,8),%xmm0
612        0.096     0.096          [37]   4021d7:  mulsd  (%r9,%rax,8),%xmm0
613        0.375     0.375          [37]   4021dd:  addsd  %xmm0,%xmm1
614     ## 1.683     1.683          [36]   4021e1:  add    $0x1,%rax
615        0.004     0.004          [36]   4021e5:  cmp    %rax,%rcx
616        0.        0.             [36]   4021e8:  jne    0xffffffffffffffea
617                             38.        c[i] = row_sum;
618        0.        0.             [38]   4021ea:  movsd  %xmm1,(%r10,%rdi,8)
619        0.        0.             [34]   4021f0:  add    $0x1,%rdi
620        0.        0.             [34]   4021f4:  cmp    %rdi,%rsi
621        0.        0.             [34]   4021f7:  jb     0xd
622        0.        0.             [35]   4021f9:  pxor   %xmm1,%xmm1
623        0.        0.             [36]   4021fd:  test   %rcx,%rcx
624        0.        0.             [36]   402200:  jne    0xffffffffffffffc5
625        0.        0.             [36]   402202:  jmp    0xffffffffffffffe8
626                             39.    }
627                             40. }
628        0.        0.             [40]   402204:  ret
629
630   For each instruction, the timing values are given and we can exactly
631which ones are the most expensive.  As with the source level view, the
632most expensive instructions are market with the '##' symbol.
633
634   As illustrated below and similar to the 'lines' command, we can get
635an overview of the instructions executed by using the 'pcs' command.
636
637Below the command and the output, which again has been restricted to 10
638lines:
639
640     $ gprofng display text -pcs test.1.er
641
642     PCs sorted by metric: Exclusive Total CPU Time
643
644     Excl.     Incl.      Name
645     Total     Total
646     CPU sec.  CPU sec.
647     2.272     2.272  <Total>
648     1.683     1.683  mxv_core + 0x00000027, line 36 in "mxv.c"
649     0.375     0.375  mxv_core + 0x00000023, line 37 in "mxv.c"
650     0.096     0.096  mxv_core + 0x0000001D, line 37 in "mxv.c"
651     0.027     0.027  init_data + 0x000000BD, line 72 in "manage_data.c"
652     0.012     0.012  init_data + 0x00000117, line 77 in "manage_data.c"
653     0.008     0.008  _int_malloc + 0x00000A45
654     0.007     0.007  erand48_r + 0x00000062
655     0.006     0.006  drand48 + 0x00000000
656     0.005     0.005  __drand48_iterate + 0x00000005
657
658
659File: gprofng.info,  Node: Display and Define the Metrics,  Next: A First Customization of the Output,  Prev: The Disassembly View,  Up: Getting Started
660
6613.1.5 Display and Define the Metrics
662------------------------------------
663
664The default metrics shown by 'gprofng display text' are useful, but
665there is more recorded than displayed.  We can customize the values
666shown by defining the metrics ourselves.
667
668   There are two commands related to changing the metrics shown:
669'metric_list' and 'metrics'.
670
671   The first command shows the metrics in use, plus all the metrics that
672have been stored as part of the experiment.  The second command may be
673used to define the metric list.
674
675   In our example we get the following values for the metrics:
676
677     $ gprofng display text -metric_list test.1.er
678
679     Current metrics: e.totalcpu:i.totalcpu:name
680     Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
681     Available metrics:
682        Exclusive Total CPU Time: e.%totalcpu
683        Inclusive Total CPU Time: i.%totalcpu
684                            Size: size
685                      PC Address: address
686                            Name: name
687
688   This shows the metrics currently in use, the metric that is used to
689sort the data and all the metrics that have been recorded, but are not
690necessarily shown.
691
692   In this case, the default metrics are set to the exclusive and
693inclusive total CPU times, plus the name of the function, or load
694object.
695
696   The 'metrics' command is used to define the metrics that need to be
697displayed.
698
699   For example, to display the exclusive total CPU time, both as a
700number and a percentage, use the following metric definition:
701'e.%totalcpu'
702
703   Since the metrics can be tailored for different views, there is a way
704to reset them to the default.  This is done through the special keyword
705'default'.
706
707
708File: gprofng.info,  Node: A First Customization of the Output,  Next: Name the Experiment Directory,  Prev: Display and Define the Metrics,  Up: Getting Started
709
7103.1.6 A First Customization of the Output
711-----------------------------------------
712
713With the information just given, we can customize the function overview.
714For sake of the example, we would like to display the name of the
715function first, followed by the exclusive CPU time, given as an absolute
716number and a percentage.
717
718   Note that the commands are parsed in order of appearance.  This is
719why we need to define the metrics _before_ requesting the function
720overview:
721
722     $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
723
724     Current metrics: name:e.%totalcpu
725     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
726     Functions sorted by metric: Exclusive Total CPU Time
727
728     Name                Excl. Total
729                         CPU
730                          sec.      %
731      <Total>            2.272 100.00
732      mxv_core           2.160  95.04
733      init_data          0.047   2.06
734      erand48_r          0.030   1.32
735      __drand48_iterate  0.013   0.57
736      drand48            0.013   0.57
737      _int_malloc        0.008   0.35
738      brk                0.001   0.04
739      sysmalloc          0.001   0.04
740      __default_morecore 0.      0.
741      __libc_start_main  0.      0.
742      allocate_data      0.      0.
743      collector_root     0.      0.
744      driver_mxv         0.      0.
745      main               0.      0.
746      malloc             0.      0.
747      sbrk               0.      0.
748
749   This was a first and simple example how to customize the output.
750Note that we did not rerun our profiling job and merely modified the
751display settings.  Below we will show other and also more advanced
752examples of customization.
753
754
755File: gprofng.info,  Node: Name the Experiment Directory,  Next: Control the Number of Lines in the Output,  Prev: A First Customization of the Output,  Up: Getting Started
756
7573.1.7 Name the Experiment Directory
758-----------------------------------
759
760When using 'gprofng collect app', the default names for experiments work
761fine, but they are quite generic.  It is often more convenient to select
762a more descriptive name.  For example, one that reflects conditions for
763the experiment conducted.
764
765   For this, the mutually exclusive '-o' and '-O' options come in handy.
766Both may be used to provide a name for the experiment directory, but the
767behaviour of 'gprofng collect app' is different.
768
769   With the '-o' option, an existing experiment directory is not
770overwritten.  You either need to explicitly remove an existing directory
771first, or use a name that is not in use yet.
772
773   This is in contrast with the behaviour for the '-O' option.  Any
774existing (experiment) directory with the same name is silently
775overwritten.
776
777   Be aware that the name of the experiment directory has to end with
778'.er'.
779
780
781File: gprofng.info,  Node: Control the Number of Lines in the Output,  Next: Sorting the Performance Data,  Prev: Name the Experiment Directory,  Up: Getting Started
782
7833.1.8 Control the Number of Lines in the Output
784-----------------------------------------------
785
786The 'limit <n>' command can be used to control the number of lines
787printed in various overviews, including the function view, but it also
788takes effect for other display commands, like 'lines'.
789
790   The argument '<n>' should be a positive integer number.  It sets the
791number of lines in the function view.  A value of zero resets the limit
792to the default.
793
794   Be aware that the pseudo-function '<Total>' counts as a regular
795function.  For example 'limit 10' displays nine user level functions.
796
797
798File: gprofng.info,  Node: Sorting the Performance Data,  Next: Scripting,  Prev: Control the Number of Lines in the Output,  Up: Getting Started
799
8003.1.9 Sorting the Performance Data
801----------------------------------
802
803The 'sort <key>' command sets the key to be used when sorting the
804performance data.
805
806   The key is a valid metric definition, but the visibility field (*Note
807Metric Definitions::) in the metric definition is ignored since this
808does not affect the outcome of the sorting operation.  For example if we
809set the sort key to 'e.totalcpu', the values will be sorted in
810descending order with respect to the exclusive total CPU time.
811
812   The data can be sorted in reverse order by prepending the metric
813definition with a minus ('-') sign.  For example 'sort -e.totalcpu'.
814
815   A default metric for the sort operation has been defined and since
816this is a persistent command, this default can be restored with
817'default' as the key.
818
819
820File: gprofng.info,  Node: Scripting,  Next: A More Elaborate Example,  Prev: Sorting the Performance Data,  Up: Getting Started
821
8223.1.10 Scripting
823----------------
824
825As is probably clear by now, the list with commands for 'gprofng display
826text' can be very long.  This is tedious and also error prone.  Luckily,
827there is an easier and more elegant way to control the behaviour of this
828tool.
829
830   Through the 'script' command, the name of a file with commands can be
831passed in.  These commands are parsed and executed as if they appeared
832on the command line in the same order as encountered in the file.  The
833commands in this script file can actually be mixed with commands on the
834command line.
835
836   The difference between the commands in the script file and those used
837on the command line is that the latter require a leading dash ('-')
838symbol.
839
840   Comment lines are supported.  They need to start with the '#' symbol.
841
842
843File: gprofng.info,  Node: A More Elaborate Example,  Next: The Call Tree,  Prev: Scripting,  Up: Getting Started
844
8453.1.11 A More Elaborate Example
846-------------------------------
847
848With the information presented so far, we can customize our data
849gathering and display commands.
850
851   As an example, to reflect the name of the algorithm and the number of
852threads that were used in the experiment, we select 'mxv.1.thr.er' as
853the name of the experiment directory.  All we then need to do is to add
854the '-O' option followed by this name on the command line when running
855'gprofng collect app':
856
857     $ exe=mxv-pthreads.exe
858     $ m=3000
859     $ n=2000
860     $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
861
862   The commands to generate the profile are put into a file that we
863simply call 'my-script':
864
865     $ cat my-script
866     # This is my first gprofng script
867     # Set the metrics
868     metrics i.%totalcpu:e.%totalcpu:name
869     # Use the exclusive time to sort
870     sort e.totalcpu
871     # Limit the function list to 5 lines
872     limit 5
873     # Show the function list
874     functions
875
876   This script file is then specified as input to the 'gprofng display
877text' command that is used to display the performance information stored
878in 'mxv.1.thr.er':
879
880     $ gprofng display text -script my-script mxv.1.thr.er
881
882   The command above produces the following output:
883
884     # This is my first gprofng script
885     # Set the metrics
886     Current metrics: i.%totalcpu:e.%totalcpu:name
887     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
888     # Use the exclusive time to sort
889     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
890     # Limit the function list to 5 lines
891     Print limit set to 5
892     # Show the function list
893     Functions sorted by metric: Exclusive Total CPU Time
894
895     Incl. Total   Excl. Total    Name
896     CPU           CPU
897      sec.      %   sec.      %
898     2.272 100.00  2.272 100.00   <Total>
899     2.159  95.00  2.159  95.00   mxv_core
900     0.102   4.48  0.054   2.37   init_data
901     0.035   1.54  0.025   1.10   erand48_r
902     0.048   2.11  0.013   0.57   drand48
903
904   In the first part of the output, our comment lines in the script file
905are shown.  These are interleaved with an acknowledgement message for
906the commands.
907
908   This is followed by a profile consisting of 5 lines only.  For both
909metrics, the percentages plus the timings are given.  The numbers are
910sorted with respect to the exclusive total CPU time.
911
912   It is now immediately clear that function 'mxv_core' is responsbile
913for 95% of the CPU time and 'init_data' takes 4.5% only.
914
915   This is also where we see sampling in action.  Although this is
916exactly the same job we profiled before, the timings are somewhat
917different, but the differences are very small.
918
919
920File: gprofng.info,  Node: The Call Tree,  Next: More Information on the Experiment,  Prev: A More Elaborate Example,  Up: Getting Started
921
9223.1.12 The Call Tree
923--------------------
924
925The call tree shows the dynamic hierarchy of the application by
926displaying the functions executed and their parent.  It helps to find
927the most expensive path in the program.
928
929   This feature is enabled through the 'calltree' command.  This is how
930to get this tree for our current experiment:
931
932     $ gprofng display text -calltree mxv.1.thr.er
933
934   This displays the following structure:
935
936     Functions Call Tree. Metric: Attributed Total CPU Time
937
938     Attr.      Name
939     Total
940     CPU sec.
941     2.272      +-<Total>
942     2.159        +-collector_root
943     2.159        |  +-driver_mxv
944     2.159        |    +-mxv_core
945     0.114        +-__libc_start_main
946     0.114          +-main
947     0.102            +-init_data
948     0.048            |  +-drand48
949     0.035            |    +-erand48_r
950     0.010            |      +-__drand48_iterate
951     0.011            +-allocate_data
952     0.011            |  +-malloc
953     0.011            |    +-_int_malloc
954     0.001            |      +-sysmalloc
955     0.001            +-check_results
956     0.001              +-malloc
957     0.001                +-_int_malloc
958
959   At first sight this may not be what you expected and some explanation
960is in place.
961
962   First of all, function 'collector_root' is internal to 'gprofng' and
963should be hidden to the user.  This is part of a planned future
964enhancement.
965
966   Recall that the 'objects' and 'fsingle' commands are very useful to
967find out more about load objects in general, but also to help identify
968an unknown entry in the function overview.  *Note Load Objects and
969Functions::.
970
971   Another thing to note is that there are two main branches.  The one
972under 'collector_root' and the second one under '__libc_start_main'.
973This reflects the fact that we are executing a parallel program.  Even
974though we only used one thread for this run, this is still executed in a
975separate path.
976
977   The main, sequential part of the program is displayed under 'main'
978and shows the functions called and the time they took.
979
980   There are two things worth noting for the call tree feature:
981
982   * This is a dynamic tree and since sampling is used, it most likely
983     looks slighlty different across seemingly identical profile runs.
984     In case the run times are short, it is worth considering to use a
985     high resolution through the '-p' option.  For example to use '-p
986     hi' to increase the sampling rate.
987
988   * In case hardware event counters have been enabled (*Note Profile
989     Hardware Event Counters::), these values are also displayed in the
990     call tree view.
991
992
993File: gprofng.info,  Node: More Information on the Experiment,  Next: Control the Sampling Frequency,  Prev: The Call Tree,  Up: Getting Started
994
9953.1.13 More Information on the Experiment
996-----------------------------------------
997
998The experiment directory not only contains performance related data.
999Several system characteristics, the actually command executed, and some
1000global performance statistics can be displayed.
1001
1002   The 'header' command displays information about the experiment(s).
1003For example, this is the command to extract this data from for our
1004experiment directory:
1005
1006     $ gprofng display text -header mxv.1.thr.er
1007
1008   The above command prints the following information.  Note that some
1009of the lay-out and the information has been modified.  The textual
1010changes are marked with the '<' and '>' symbols.
1011
1012     Experiment: mxv.1.thr.er
1013     No errors
1014     No warnings
1015     Archive command `gp-archive -n -a on
1016              --outfile <exp_dir>/archive.log <exp_dir>'
1017
1018     Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
1019     Process pid 30591, ppid 30589, pgrp 30551, sid 30468
1020     Current working directory: <cwd>
1021     Collector version: `2.36.50'; experiment version 12.4 (64-bit)
1022     Host `<hostname>', OS `Linux <version>', page size 4096,
1023          architecture `x86_64'
1024       16 CPUs, clock speed 1995 MHz.
1025       Memory: 30871514 pages @  4096 = 120591 MB.
1026     Data collection parameters:
1027       Clock-profiling, interval = 997 microsecs.
1028       Periodic sampling, 1 secs.
1029       Follow descendant processes from: fork|exec|combo
1030
1031     Experiment started <date and time>
1032
1033     Experiment Ended: 2.293162658
1034     Data Collection Duration: 2.293162658
1035
1036   The output above may assist in troubleshooting, or to verify some of
1037the operational conditions and we recommand to include this command when
1038generating a profile.
1039
1040   Related to this command there is a useful option to record your own
1041comment(s) in an experiment.  To this end, use the '-C' option on the
1042'gprofng collect app' tool to specify a comment string.  Up to ten
1043comment lines can be included.  These comments are displayed with the
1044'header' command on the 'gprofng display text' tool.
1045
1046   The 'overview' command displays information on the experiment(s) and
1047also shows a summary of the values for the metric(s) used.  This is an
1048example how to use it on our newly created experiment directory:
1049
1050     $ gprofng display text -overview mxv.1.thr.er
1051
1052     Experiment(s):
1053
1054     Experiment      :mxv.1.thr.er
1055       Target        : './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
1056       Host          : <hostname> (<ISA>, Linux <version>)
1057       Start Time    : <date and time>
1058       Duration      : 2.293 Seconds
1059
1060     Metrics:
1061
1062       Experiment Duration (Seconds): [2.293]
1063       Clock Profiling
1064         [X]Total CPU Time - totalcpu (Seconds): [*2.272]
1065
1066     Notes: '*' indicates hot metrics, '[X]' indicates currently enabled
1067            metrics.
1068            The metrics command can be used to change selections. The
1069            metric_list command lists all available metrics.
1070
1071   This command provides a dashboard overview that helps to easily
1072identify where the time is spent and in case hardware event counters are
1073used, it shows their total values.
1074
1075
1076File: gprofng.info,  Node: Control the Sampling Frequency,  Next: Information on Load Objects,  Prev: More Information on the Experiment,  Up: Getting Started
1077
10783.1.14 Control the Sampling Frequency
1079-------------------------------------
1080
1081So far we did not talk about the frequency of the sampling process, but
1082in some cases it is useful to change the default of 10 milliseconds.
1083
1084   The advantage of increasing the sampling frequency is that functions
1085that do not take much time per invocation are more accurately captured.
1086The downside is that more data is gathered.  This has an impact on the
1087overhead of the collection process and more disk space is required.
1088
1089   In general this is not an immediate concern, but with heavily
1090threaded applications that run for an extended period of time,
1091increasing the frequency may have a more noticeable impact.
1092
1093   The '-p' option on the 'gprofng collect app' tool is used to enable
1094or disable clock based profiling, or to explicitly set the sampling
1095rate.  This option takes one of the following keywords:
1096
1097'off'
1098     Disable clock based profiling.
1099
1100'on'
1101     Enable clock based profiling with a per thread sampling interval of
1102     10 ms.  This is the default.
1103
1104'lo'
1105     Enable clock based profiling with a per thread sampling interval of
1106     100 ms.
1107
1108'hi'
1109     Enable clock based profiling with a per thread sampling interval of
1110     1 ms.
1111
1112'<value>'
1113     Enable clock based profiling with a per thread sampling interval of
1114     <value>.
1115
1116   One may wonder why there is an option to disable clock based
1117profiling.  This is because by default, it is enabled when conducting
1118hardware event counter experiments (*Note Profile Hardware Event
1119Counters::).  With the '-p off' option, this can be disabled.
1120
1121   If an explicit value is set for the sampling, the number can be an
1122integer or a floating-point number.  A suffix of 'u' for microseconds,
1123or 'm' for milliseconds is supported.  If no suffix is used, the value
1124is assumed to be in milliseconds.
1125
1126   If the value is smaller than the clock profiling minimum, a warning
1127message is issued and it is set to the minimum.  In case it is not a
1128multiple of the clock profiling resolution, it is silently rounded down
1129to the nearest multiple of the clock resolution.
1130
1131   If the value exceeds the clock profiling maximum, is negative, or
1132zero, an error is reported.
1133
1134   Note that the 'header' command echoes the sampling rate used.
1135
1136
1137File: gprofng.info,  Node: Information on Load Objects,  Prev: Control the Sampling Frequency,  Up: Getting Started
1138
11393.1.15 Information on Load Objects
1140----------------------------------
1141
1142It may happen that the function list contains a function that is not
1143known to the user.  This can easily happen with library functions for
1144example.  Luckily there are three commands that come in handy then.
1145
1146   These commands are 'objects', 'fsingle', and 'fsummary'.  They
1147provide details on load objects (*Note Load Objects and Functions::).
1148
1149   The 'objects' command lists all load objects that have been
1150referenced during the performance experiment.  Below we show the command
1151and the result for our profile job.  Like before, the (long) path names
1152in the output have been shortened and replaced by the '<apath>' symbol
1153that represents an absolute directory path.
1154
1155     $ gprofng display text -objects mxv.1.thr.er
1156
1157   The output includes the name and path of the target executable:
1158
1159      <Unknown> (<Unknown>)
1160      <mxv-pthreads.exe> (<apath>/mxv-pthreads.exe)
1161      <librt-2.17.so> (/usr/lib64/librt-2.17.so)
1162      <libdl-2.17.so> (/usr/lib64/libdl-2.17.so)
1163      <libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>)
1164      <libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>)
1165      <libc-2.17.so> (/usr/lib64/libc-2.17.so)
1166      <libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so)
1167      <libm-2.17.so> (/usr/lib64/libm-2.17.so)
1168      <libgp-collector.so> (<apath>/libgp-collector.so)
1169      <ld-2.17.so> (/usr/lib64/ld-2.17.so)
1170      <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
1171
1172   The 'fsingle' command may be used to get more details on a specific
1173entry in the function view, say.  For example, the command below
1174provides additional information on the 'collector_root' function shown
1175in the function overview.
1176
1177     $ gprofng display text -fsingle collector_root mxv.1.thr.er
1178
1179   Below the output from this command.  It has been somewhat modified to
1180match the display requirements.
1181
1182     collector_root
1183       Exclusive Total CPU Time: 0.    (  0. %)
1184       Inclusive Total CPU Time: 2.159 ( 95.0%)
1185                 Size:   401
1186           PC Address: 10:0x0001db60
1187          Source File: <apath>/dispatcher.c
1188          Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b
1189          Load Object: <apath>/libgp-collector.so
1190         Mangled Name:
1191              Aliases:
1192
1193   In this table we not only see how much time was spent in this
1194function, we also see where it originates from.  In addition to this,
1195the size and start address are given as well.  If the source code
1196location is known it is also shown here.
1197
1198   The related 'fsummary' command displays the same information as
1199'fsingle', but for all functions in the function overview, including
1200'<Total>':
1201
1202     $ gprofng display text -fsummary mxv.1.thr.er
1203
1204     Functions sorted by metric: Exclusive Total CPU Time
1205
1206     <Total>
1207       Exclusive Total CPU Time: 2.272 (100.0%)
1208       Inclusive Total CPU Time: 2.272 (100.0%)
1209                 Size:     0
1210           PC Address: 1:0x00000000
1211          Source File: (unknown)
1212          Object File: (unknown)
1213          Load Object: <Total>
1214         Mangled Name:
1215              Aliases:
1216
1217     mxv_core
1218       Exclusive Total CPU Time: 2.159 ( 95.0%)
1219       Inclusive Total CPU Time: 2.159 ( 95.0%)
1220                 Size:    75
1221           PC Address: 2:0x000021ba
1222          Source File: <apath>/mxv.c
1223          Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc
1224          Load Object: <apath>/mxv-pthreads.exe
1225         Mangled Name:
1226              Aliases:
1227
1228               ... etc ...
1229
1230
1231File: gprofng.info,  Node: Support for Multithreading,  Next: Viewing Multiple Experiments,  Prev: Getting Started,  Up: A Mini Tutorial
1232
12333.2 Support for Multithreading
1234==============================
1235
1236In this chapter we introduce and discuss the support for multithreading.
1237As is shown below, nothing needs to be changed when collecting the
1238performance data.
1239
1240   The difference is that additional commands are available to get more
1241information on the parallel environment, plus that several filters allow
1242the user to zoom in on specific threads.
1243
1244* Menu:
1245
1246* Creating a Multithreading Experiment::
1247* Commands Specific to Multithreading::
1248
1249
1250File: gprofng.info,  Node: Creating a Multithreading Experiment,  Next: Commands Specific to Multithreading,  Up: Support for Multithreading
1251
12523.2.1 Creating a Multithreading Experiment
1253------------------------------------------
1254
1255We demonstrate the support for multithreading using the same code and
1256settings as before, but this time we use 2 threads:
1257
1258     $ exe=mxv-pthreads.exe
1259     $ m=3000
1260     $ n=2000
1261     $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
1262
1263   First of all, note that we did not change anything, other than
1264setting the number of threads to 2.  Nothing special is needed to
1265profile a multithreaded job when using 'gprofng'.
1266
1267   The same is true when displaying the performance results.  The same
1268commands that we used before work unmodified.  For example, this is all
1269that is needed to get a function overview:
1270
1271     $ gpprofng display text -limit 10 -functions mxv.2.thr.er
1272
1273   This produces the following familiar looking output:
1274
1275     Print limit set to 10
1276     Functions sorted by metric: Exclusive Total CPU Time
1277
1278     Excl.     Incl.      Name
1279     Total     Total
1280     CPU sec.  CPU sec.
1281     2.268     2.268      <Total>
1282     2.155     2.155      mxv_core
1283     0.044     0.103      init_data
1284     0.030     0.046      erand48_r
1285     0.016     0.016      __drand48_iterate
1286     0.013     0.059      drand48
1287     0.008     0.011      _int_malloc
1288     0.003     0.003      brk
1289     0.        0.003      __default_morecore
1290     0.        0.114      __libc_start_main
1291
1292
1293File: gprofng.info,  Node: Commands Specific to Multithreading,  Prev: Creating a Multithreading Experiment,  Up: Support for Multithreading
1294
12953.2.2 Commands Specific to Multithreading
1296-----------------------------------------
1297
1298The function overview shown above shows the results aggregated over all
1299the threads.  The interesting new element is that we can also look at
1300the performance data for the individual threads.
1301
1302   The 'thread_list' command displays how many threads have been used:
1303
1304     $ gprofng display text -thread_list mxv.2.thr.er
1305
1306   This produces the following output, showing that three threads have
1307been used:
1308
1309     Exp Sel Total
1310     === === =====
1311       1 all     3
1312
1313   The output confirms there is one experiment and that by default all
1314threads are selected.
1315
1316   It may seem surprising to see three threads here, since we used the
1317'-t 2' option, but it is common for a Pthreads program to use one
1318additional thread.  This is typically the thread that runs from start to
1319finish and handles the sequential portions of the code, as well as takes
1320care of managing the threads.
1321
1322   It is no different in our example code.  At some point, the main
1323thread creates and activates the two threads that perform the
1324multiplication of the matrix with the vector.  Upon completion of this
1325computation, the main thread continues.
1326
1327   The 'threads' command is simple, yet very powerful.  It shows the
1328total value of the metrics for each thread.  To make it easier to
1329interpret the data, we modify the metrics to include percentages:
1330
1331     $ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er
1332
1333   The command above produces the following overview:
1334
1335     Current metrics: e.%totalcpu:name
1336     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1337     Objects sorted by metric: Exclusive Total CPU Time
1338
1339     Excl. Total    Name
1340     CPU
1341      sec.      %
1342     2.258 100.00   <Total>
1343     1.075  47.59   Process 1, Thread 3
1344     1.070  47.37   Process 1, Thread 2
1345     0.114   5.03   Process 1, Thread 1
1346
1347   The first line gives the total CPU time accumulated over the threads
1348selected.  This is followed by the metric value(s) for each thread.
1349
1350   From this it is clear that the main thread is responsible for 5% of
1351the total CPU time, while the other two threads take 47% each.
1352
1353   This view is ideally suited to verify if there any load balancing
1354issues and also to find the most time consuming thread(s).
1355
1356   While useful, often more information than this is needed.  This is
1357where the thread selection filter comes in.  Through the 'thread_select'
1358command, one or more threads may be selected (*Note The Selection List::
1359how to define the selection list).
1360
1361   Since it is most common to use this command in a script, we do so as
1362well here.  Below the script we are using:
1363
1364     # Define the metrics
1365     metrics e.%totalcpu
1366     # Limit the output to 10 lines
1367     limit 10
1368     # Get the function overview for thread 1
1369     thread_select 1
1370     functions
1371     # Get the function overview for thread 2
1372     thread_select 2
1373     functions
1374     # Get the function overview for thread 3
1375     thread_select 3
1376     functions
1377
1378   The definition of the metrics and the output limiter has been shown
1379and explained before and will be ignored.  The new command we focus on
1380is 'thread_select'.
1381
1382   This command takes a list (*Note The Selection List::) to select
1383specific threads.  In this case we simply use the individual thread
1384numbers that we obtained with the 'thread_list' command earlier.
1385
1386   This restricts the output of the 'functions' command to the thread
1387number(s) specified.  This means that the script above shows which
1388function(s) each thread executes and how much CPU time they consumed.
1389Both the timings and their percentages are given.
1390
1391   This is the relevant part of the output for the first thread:
1392
1393     # Get the function overview for thread 1
1394     Exp Sel Total
1395     === === =====
1396       1 1       3
1397     Functions sorted by metric: Exclusive Total CPU Time
1398
1399     Excl. Total    Name
1400     CPU
1401      sec.      %
1402     0.114 100.00   <Total>
1403     0.051  44.74   init_data
1404     0.028  24.56   erand48_r
1405     0.017  14.91   __drand48_iterate
1406     0.010   8.77   _int_malloc
1407     0.008   7.02   drand48
1408     0.      0.     __libc_start_main
1409     0.      0.     allocate_data
1410     0.      0.     main
1411     0.      0.     malloc
1412
1413   As usual, the comment lines are echoed.  This is followed by a
1414confirmation of our selection.  We see that indeed thread 1 has been
1415selected.  What is displayed next is the function overview for this
1416particular thread.  Due to the 'limit 10' command, there are ten entries
1417in this list.
1418
1419   Below are the overviews for threads 2 and 3 respectively.  We see
1420that all of the CPU time is spent in function 'mxv_core' and that this
1421time is approximately the same for both threads.
1422
1423     # Get the function overview for thread 2
1424     Exp Sel Total
1425     === === =====
1426       1 2       3
1427     Functions sorted by metric: Exclusive Total CPU Time
1428
1429     Excl. Total    Name
1430     CPU
1431      sec.      %
1432     1.072 100.00   <Total>
1433     1.072 100.00   mxv_core
1434     0.      0.     collector_root
1435     0.      0.     driver_mxv
1436
1437     # Get the function overview for thread 3
1438     Exp Sel Total
1439     === === =====
1440       1 3       3
1441     Functions sorted by metric: Exclusive Total CPU Time
1442
1443     Excl. Total    Name
1444     CPU
1445      sec.      %
1446     1.076 100.00   <Total>
1447     1.076 100.00   mxv_core
1448     0.      0.     collector_root
1449     0.      0.     driver_mxv
1450
1451   When analyzing the performance of a multithreaded application, it is
1452sometimes useful to know whether threads have mostly executed on the
1453same core, say, or if they have wandered across multiple cores.  This
1454sort of stickiness is usually referred to as _thread affinity_.
1455
1456   Similar to the commands for the threads, there are several commands
1457related to the usage of the cores, or _CPUs_ as they are called in
1458'gprofng' (*Note The Concept of a CPU in gprofng::).
1459
1460   In order to have some more interesting data to look at, we created a
1461new experiment, this time using 8 threads:
1462
1463     $ exe=mxv-pthreads.exe
1464     $ m=3000
1465     $ n=2000
1466     $ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8
1467
1468   Similar to the 'thread_list' command, the 'cpu_list' command displays
1469how many CPUs have been used.  The equivalent of the 'threads' threads
1470command, is the 'cpus' command, which shows the CPU numbers that were
1471used and how much time was spent on each of them.  Both are demonstrated
1472below.
1473
1474     $ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er
1475
1476   This command produces the following output:
1477
1478     Current metrics: e.%totalcpu:name
1479     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1480     Exp Sel Total
1481     === === =====
1482       1 all    10
1483     Objects sorted by metric: Exclusive Total CPU Time
1484
1485     Excl. Total    Name
1486     CPU
1487      sec.      %
1488     2.310 100.00   <Total>
1489     0.286  12.39   CPU 7
1490     0.284  12.30   CPU 13
1491     0.282  12.21   CPU 5
1492     0.280  12.13   CPU 14
1493     0.266  11.52   CPU 9
1494     0.265  11.48   CPU 2
1495     0.264  11.44   CPU 11
1496     0.194   8.42   CPU 0
1497     0.114   4.92   CPU 1
1498     0.074   3.19   CPU 15
1499
1500   What we see in this table is that a total of 10 CPUs have been used.
1501This is followed by a list with all the CPU numbers that have been used
1502during the run.  For each CPU it is shown how much time was spent on it.
1503
1504   While the table with thread times shown earlier may point at a load
1505imbalance in the application, this overview has a different purpose.
1506
1507   For example, we see that 10 CPUs have been used, but we know that the
1508application uses 9 threads only.  This means that at least one thread
1509has executed on more than one CPU. In itself this is not something to
1510worry about, but warrants a deeper investigation.
1511
1512   Honesty dictates that next we performed a pre-analysis to find out
1513which thread(s) have been running on more than one CPU. We found this to
1514be thread 7.  It has executed on CPUs 0 and 15.
1515
1516   With this knowledge, we wrote the script shown below.  It zooms in on
1517the behaviour of thread 7.
1518
1519     # Define the metrics
1520     metrics e.%totalcpu
1521     # Limit the output to 10 lines
1522     limit 10
1523     functions
1524     # Get the function overview for CPU 0
1525     cpu_select 0
1526     functions
1527     # Get the function overview for CPU 15
1528     cpu_select 15
1529     functions
1530
1531   From the earlier shown threads overview, we know that thread 7 has
1532used '0.268' seconds of CPU time..
1533
1534   By selecting CPUs 0 and 15, respectively, we get the following
1535function overviews:
1536
1537     # Get the function overview for CPU 0
1538     Exp Sel Total
1539     === === =====
1540       1 0      10
1541     Functions sorted by metric: Exclusive Total CPU Time
1542
1543     Excl. Total    Name
1544     CPU
1545      sec.      %
1546     0.194 100.00   <Total>
1547     0.194 100.00   mxv_core
1548     0.      0.     collector_root
1549     0.      0.     driver_mxv
1550
1551     # Get the function overview for CPU 15
1552     Exp Sel Total
1553     === === =====
1554       1 15     10
1555     Functions sorted by metric: Exclusive Total CPU Time
1556
1557     Excl. Total    Name
1558     CPU
1559      sec.      %
1560     0.074 100.00   <Total>
1561     0.074 100.00   mxv_core
1562     0.      0.     collector_root
1563     0.      0.     driver_mxv
1564
1565   This shows that thread 7 spent '0.194' seconds on CPU 0 and '0.074'
1566seconds on CPU 15.
1567
1568
1569File: gprofng.info,  Node: Viewing Multiple Experiments,  Next: Profile Hardware Event Counters,  Prev: Support for Multithreading,  Up: A Mini Tutorial
1570
15713.3 Viewing Multiple Experiments
1572================================
1573
1574One thing we did not cover sofar is that 'gprofng' fully supports the
1575analysis of multiple experiments.  The 'gprofng display text' tool
1576accepts a list of experiments.  The data can either be aggregated across
1577the experiments, or used in a comparison.
1578
1579   Mention 'experiment_list'
1580
1581* Menu:
1582
1583* Aggregation of Experiments::
1584* Comparison of Experiments::
1585
1586
1587File: gprofng.info,  Node: Aggregation of Experiments,  Next: Comparison of Experiments,  Up: Viewing Multiple Experiments
1588
15893.3.1 Aggregation of Experiments
1590--------------------------------
1591
1592By default, the data for multiple experiments is aggregrated and the
1593display commands shows these combined results.
1594
1595   For example, we can aggregate the data for our single and dual thread
1596experiments.  Below is the script we used for this:
1597
1598     # Define the metrics
1599     metrics e.%totalcpu
1600     # Limit the output to 10 lines
1601     limit 10
1602     # Get the list with experiments
1603     experiment_list
1604     # Get the function overview
1605     functions
1606
1607   With the exception of the 'experiment_list' command, all commands
1608used have been discussed earlier.
1609
1610   The 'experiment_list' command provides a list of the experiments that
1611have been loaded.  This is is used to verify we are looking at the
1612experiments we intend to aggregate.
1613
1614     $ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er
1615
1616   With the command above, we get the following output:
1617
1618     # Define the metrics
1619     Current metrics: e.%totalcpu:name
1620     Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1621     # Limit the output to 10 lines
1622     Print limit set to 10
1623     # Get the list with experiments
1624     ID Sel   PID Experiment
1625     == === ===== ============
1626      1 yes 30591 mxv.1.thr.er
1627      2 yes 11629 mxv.2.thr.er
1628     # Get the function overview
1629     Functions sorted by metric: Exclusive Total CPU Time
1630
1631     Excl. Total    Name
1632     CPU
1633      sec.      %
1634     4.533 100.00   <Total>
1635     4.306  94.99   mxv_core
1636     0.105   2.31   init_data
1637     0.053   1.17   erand48_r
1638     0.027   0.59   __drand48_iterate
1639     0.021   0.46   _int_malloc
1640     0.021   0.46   drand48
1641     0.001   0.02   sysmalloc
1642     0.      0.     __libc_start_main
1643     0.      0.     allocate_data
1644
1645   The first five lines should look familiar.  The five lines following,
1646echo the comment line in the script and show the overview of the
1647experiments.  This confirms two experiments have been loaded and that
1648both are active.
1649
1650   This is followed by the function overview.  The timings have been
1651summed up and the percentages are adjusted accordingly.  For example,
1652the total accumulated time is indeed 2.272 + 2.261 = 4.533 seconds.
1653
1654
1655File: gprofng.info,  Node: Comparison of Experiments,  Prev: Aggregation of Experiments,  Up: Viewing Multiple Experiments
1656
16573.3.2 Comparison of Experiments
1658-------------------------------
1659
1660The support for multiple experiments really shines in comparison mode.
1661This feature is enabled through the command 'compare on' and is disabled
1662by setting 'compare off'.
1663
1664   In comparison mode, the data for the various experiments is shown
1665side by side, as illustrated below where we compare the results for the
1666multithreaded experiments using one and two threads respectively:
1667
1668     $ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er
1669
1670This produces the following output:
1671
1672     Functions sorted by metric: Exclusive Total CPU Time
1673
1674     mxv.1.thr.er  mxv.2.thr.er  mxv.1.thr.er  mxv.2.thr.er
1675     Excl. Total   Excl. Total   Incl. Total   Incl. Total    Name
1676     CPU           CPU           CPU           CPU
1677      sec.          sec.          sec.          sec.
1678     2.272         2.261         2.272         2.261          <Total>
1679     2.159         2.148         2.159         2.148          mxv_core
1680     0.054         0.051         0.102         0.104          init_data
1681     0.025         0.028         0.035         0.045          erand48_r
1682     0.013         0.008         0.048         0.053          drand48
1683     0.011         0.010         0.012         0.010          _int_malloc
1684     0.010         0.017         0.010         0.017          __drand48_iterate
1685     0.001         0.            0.001         0.             sysmalloc
1686     0.            0.            0.114         0.114          __libc_start_main
1687     0.            0.            0.011         0.010          allocate_data
1688     0.            0.            0.001         0.             check_results
1689     0.            0.            2.159         2.148          collector_root
1690     0.            0.            2.159         2.148          driver_mxv
1691     0.            0.            0.114         0.114          main
1692     0.            0.            0.012         0.010          malloc
1693
1694   This table is already helpful to more easily compare (two) profiles,
1695but there is more that we can do here.
1696
1697   By default, in comparison mode, all measured values are shown.  Often
1698profiling is about comparing performance data.  It is therefore more
1699useful to look at differences, or ratios, using one experiment as a
1700reference.
1701
1702   The values shown are relative to this difference.  For example if a
1703ratio is below one, it means the reference value was higher.
1704
1705   This feature is supported on the 'compare' command.  In addition to
1706'on', or 'off', this command also supports 'delta', or 'ratio'.
1707
1708   Usage of one of these two keywords enables the comparison feature and
1709shows either the difference, or the ratio, relative to the reference
1710data.
1711
1712   In the example below, we use the same two experiments used in the
1713comparison above, but as before, the number of lines is restricted to 10
1714and we focus on the exclusive timings plus percentages.  For the
1715comparison part we are interested in the differences.
1716
1717   This is the script that produces such an overview:
1718
1719     # Define the metrics
1720     metrics e.%totalcpu
1721     # Limit the output to 10 lines
1722     limit 10
1723     # Set the comparison mode to differences
1724     compare delta
1725     # Get the function overview
1726     functions
1727
1728   Assuming this script file is called 'my-script-comp', this is how we
1729get the table displayed on our screen:
1730
1731     $ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er
1732
1733   Leaving out some of the lines printed, but we have seen before, we
1734get the following table:
1735
1736     mxv.1.thr.er  mxv.2.thr.er
1737     Excl. Total   Excl. Total     Name
1738     CPU           CPU
1739      sec.      %   delta      %
1740     2.272 100.00  -0.011 100.00   <Total>
1741     2.159  95.00  -0.011  94.97   mxv_core
1742     0.054   2.37  -0.003   2.25   init_data
1743     0.025   1.10  +0.003   1.23   erand48_r
1744     0.013   0.57  -0.005   0.35   drand48
1745     0.011   0.48  -0.001   0.44   _int_malloc
1746     0.010   0.44  +0.007   0.75   __drand48_iterate
1747     0.001   0.04  -0.001   0.     sysmalloc
1748     0.      0.    +0.      0.     __libc_start_main
1749     0.      0.    +0.      0.     allocate_data
1750
1751   It is now easy to see that the CPU times for the most time consuming
1752functions in this code are practically the same.
1753
1754   While in this case we used the delta as a comparison,
1755
1756   Note that the comparison feature is supported at the function,
1757source, and disassembly level.  There is no practical limit on the
1758number of experiments that can be used in a comparison.
1759
1760
1761File: gprofng.info,  Node: Profile Hardware Event Counters,  Next: Java Profiling,  Prev: Viewing Multiple Experiments,  Up: A Mini Tutorial
1762
17633.4 Profile Hardware Event Counters
1764===================================
1765
1766Many processors provide a set of hardware event counters and 'gprofng'
1767provides support for this feature.  *Note Hardware Event Counters
1768Explained:: for those readers that are not familiar with such counters
1769and like to learn more.
1770
1771   In this section we explain how to get the details on the event
1772counter support for the processor used in the experiment(s), and show
1773several examples.
1774
1775* Menu:
1776
1777* Getting Information on the Counters Supported::
1778* Examples Using Hardware Event Counters::
1779
1780
1781File: gprofng.info,  Node: Getting Information on the Counters Supported,  Next: Examples Using Hardware Event Counters,  Up: Profile Hardware Event Counters
1782
17833.4.1 Getting Information on the Counters Supported
1784---------------------------------------------------
1785
1786The first step is to check if the processor used for the experiments is
1787supported by 'gprofng'.
1788
1789   The '-h' option on 'gprofng collect app' will show the event counter
1790information:
1791
1792     $ gprofng collect app -h
1793
1794   In case the counters are supported, a list with the events is
1795printed.  Otherwise, a warning message will be issued.
1796
1797   For example, below we show this command and the output on an Intel
1798Xeon Platinum 8167M (aka "Skylake") processor.  The output has been
1799split into several sections and each section is commented upon
1800separately.
1801
1802     Run "gprofng collect app --help" for a usage message.
1803
1804     Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85'
1805     (cpuver=2499):
1806
1807       -h {auto|lo|on|hi}
1808     	turn on default set of HW counters at the specified rate
1809       -h <ctr_def> [-h <ctr_def>]...
1810       -h <ctr_def>[,<ctr_def>]...
1811     	specify HW counter profiling for up to 4 HW counters
1812
1813   The first line shows how to get a usage overview.  This is followed
1814by some information on the target processor.
1815
1816   The next five lines explain in what ways the '-h' option can be used
1817to define the events to be monitored.
1818
1819   The first version shown above enables a default set of counters.
1820This default depends on the processor this command is executed on.  The
1821keyword following the '-h' option defines the sampling rate:
1822
1823'auto'
1824     Match the sample rate of used by clock profiling.  If the latter is
1825     disabled, Use a per thread sampling rate of approximately 100
1826     samples per second.  This setting is the default and preferred.
1827
1828'on'
1829     Use a per thread sampling rate of approximately 100 samples per
1830     second.
1831
1832'lo'
1833     Use a per thread sampling rate of approximately 10 samples per
1834     second.
1835
1836'hi'
1837     Use a per thread sampling rate of approximately 1000 samples per
1838     second.
1839
1840   The second and third variant define the events to be monitored.  Note
1841that the number of simultaneous events supported is printed.  In this
1842case we can monitor four events in a single profiling job.
1843
1844   It is a matter of preference whether you like to use the '-h' option
1845for each event, or use it once, followed by a comma separated list.
1846
1847   There is one slight catch though.  The counter definition below has
1848mandatory comma (',') between the event and the rate.  While a default
1849can be used for the rate, the comma cannot be omitted.  This may result
1850in a somewhat awkward counter definition in case the default sampling
1851rate is used.
1852
1853   For example, the following two commands are equivalent.  Note the
1854double comma in the second command.  This is not a typo.
1855
1856     $ gprofng collect app -h cycles -h insts ...
1857     $ gprofng collect app -h cycles,,insts ...
1858
1859   In the first command this comma is not needed, because a comma
1860("','") immediately followed by white space may be omitted.
1861
1862   This is why we prefer the this syntax and in the remainder will use
1863the first version of this command.
1864
1865   The counter definition takes an event name, plus optionally one or
1866more attributes, followed by a comma, and optionally the sampling rate.
1867The output section below shows the formal definition.
1868
1869       <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
1870
1871   The printed help then explains this syntax.  Below we have summarized
1872and expanded this output:
1873
1874'<ctr>'
1875     The counter name must be selected from the available counters
1876     listed as part of the output printed with the '-h' option.  On most
1877     systems, if a counter is not listed, it may still be specified by
1878     its numeric value.
1879
1880'~<attr>=<val>'
1881     This is an optional attribute that depends on the processor.  The
1882     list of supported attributes is printed in the output.  Examples of
1883     attributes are "user", or "system".  The value can given in decimal
1884     or hexadecimal format.  Multiple attributes may be specified, and
1885     each must be preceded by a ~.
1886
1887'<rate>'
1888
1889     The sampling rate is one of the following:
1890
1891     'auto'
1892          This is the default and matches the rate used by clock
1893          profiling.  If clock profiling is disabled, use 'on'.
1894
1895     'on'
1896          Set the per thread maximum sampling rate to ~100
1897          samples/second
1898
1899     'lo'
1900          Set the per thread maximum sampling rate to ~10 samples/second
1901
1902     'hi'
1903          Set the per thread maximum sampling rate to ~1000
1904          samples/second
1905
1906     '<interval>'
1907          Define the sampling interval.  *Note Control the Sampling
1908          Frequency:: how to define this.
1909
1910   After the section with the formal definition of events and counters,
1911a processor specific list is displayed.  This part starts with an
1912overview of the default set of counters and the aliased names supported
1913_on this specific processor_.
1914
1915     Default set of HW counters:
1916
1917         -h cycles,,insts,,llm
1918
1919     Aliases for most useful HW counters:
1920
1921      alias    raw name                   type units regs description
1922
1923      cycles   unhalted-core-cycles   CPU-cycles 0123 CPU Cycles
1924      insts    instruction-retired        events 0123 Instructions Executed
1925      llm      llc-misses                 events 0123 Last-Level Cache Misses
1926      br_msp   branch-misses-retired      events 0123 Branch Mispredict
1927      br_ins   branch-instruction-retired events 0123 Branch Instructions
1928
1929   The definitions given above may or may not be available on other
1930processors, but we try to maximize the overlap across alias sets.
1931
1932   The table above shows the default set of counters defined for this
1933processor, and the aliases.  For each alias the full "raw" name is
1934given, plus the unit of the number returned by the counter (CPU cycles,
1935or a raw count), the hardware counter the event is allowed to be mapped
1936onto, and a short description.
1937
1938   The last part of the output contains all the events that can be
1939monitored:
1940
1941     Raw HW counters:
1942
1943         name                                type      units regs description
1944
1945         unhalted-core-cycles                     CPU-cycles 0123
1946         unhalted-reference-cycles                    events 0123
1947         instruction-retired                          events 0123
1948         llc-reference                                events 0123
1949         llc-misses                                   events 0123
1950         branch-instruction-retired                   events 0123
1951         branch-misses-retired                        events 0123
1952         ld_blocks.store_forward                      events 0123
1953         ld_blocks.no_sr                              events 0123
1954         ld_blocks_partial.address_alias              events 0123
1955         dtlb_load_misses.miss_causes_a_walk          events 0123
1956         dtlb_load_misses.walk_completed_4k           events 0123
1957
1958         <many lines deleted>
1959
1960         l2_lines_out.silent                          events 0123
1961         l2_lines_out.non_silent                      events 0123
1962         l2_lines_out.useless_hwpf                    events 0123
1963         sq_misc.split_lock                           events 0123
1964
1965     See Chapter 19 of the "Intel 64 and IA-32 Architectures Software
1966     Developer's Manual Volume 3B: System Programming Guide"
1967
1968   As can be seen, these names are not always easy to correlate to a
1969specific event of interest.  The processor manual should provide more
1970clarity on this.
1971
1972
1973File: gprofng.info,  Node: Examples Using Hardware Event Counters,  Prev: Getting Information on the Counters Supported,  Up: Profile Hardware Event Counters
1974
19753.4.2 Examples Using Hardware Event Counters
1976--------------------------------------------
1977
1978The previous section may give the impression that these counters are
1979hard to use, but as we will show now, in practice it is quite simple.
1980
1981   With the information from the '-h' option, we can easily set up our
1982first event counter experiment.
1983
1984   We start by using the default set of counters defined for our
1985processor and we use 2 threads:
1986
1987     $ exe=mxv-pthreads.exe
1988     $ m=3000
1989     $ n=2000
1990     $ exp=mxv.hwc.def.2.thr.er
1991     $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
1992
1993   The new option here is '-h auto'.  The 'auto' keyword enables
1994hardware event counter profiling and selects the default set of counters
1995defined for this processor.
1996
1997   As before, we can display the information, but there is one practical
1998hurdle to take.  Unless we like to view all metrics recorded, we would
1999need to know the names of the events that have been enabled.  This is
2000tedious and also not portable in case we would like to repeat this
2001experiment on another processor.
2002
2003   This is where the special 'hwc' metric comes very handy.  It
2004automatically expands to the active set of events used.
2005
2006   With this, it is very easy to display the event counter values.  Note
2007that although the regular clock based profiling was enabled, we only
2008want to see the counter values.  We also request to see the percentages
2009and limit the output to the first 5 lines:
2010
2011     $ exp=mxv.hwc.def.2.thr.er
2012     $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
2013
2014     Current metrics: e.%cycles:e+%insts:e+%llm:name
2015     Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
2016     Print limit set to 5
2017     Functions sorted by metric: Exclusive CPU Cycles
2018
2019     Excl. CPU     Excl. Instructions  Excl. Last-Level   Name
2020     Cycles        Executed            Cache Misses
2021      sec.      %                  %                 %
2022     2.691 100.00  7906475309 100.00   122658983 100.00   <Total>
2023     2.598  96.54  7432724378  94.01   121745696  99.26   mxv_core
2024     0.035   1.31   188860269   2.39       70084   0.06   erand48_r
2025     0.026   0.95    73623396   0.93      763116   0.62   init_data
2026     0.018   0.66    76824434   0.97       40040   0.03   drand48
2027
2028   As we have seen before, the first few lines echo the settings.  This
2029includes a list with the hardware event counters used by default.
2030
2031   The table that follows makes it very easy to get an overview where
2032the time is spent and how many of the target events have occurred.
2033
2034   As before, we can drill down deeper and see the same metrics at the
2035source line and instruction level.  Other than using 'hwc' in the
2036metrics definitions, nothing has changed compared to the previous
2037examples:
2038
2039     $ exp=mxv.hwc.def.2.thr.er
2040     $ gprofng display text -metrics e.hwc -source mxv_core $exp
2041
2042   This is the relevant part of the output.  Since the lines get very
2043long, we have somewhat modified the lay-out:
2044
2045        Excl. CPU Excl.        Excl.
2046        Cycles    Instructions Last-Level
2047         sec.     Executed     Cache Misses
2048                                              <Function: mxv_core>
2049        0.                 0          0   32. void __attribute__ ((noinline))
2050                                              mxv_core(...)
2051        0.                 0          0   33. {
2052        0.                 0          0   34.   for (uint64_t i=...) {
2053        0.                 0          0   35.     double row_sum = 0.0;
2054     ## 1.872     7291879319   88150571   36.     for (int64_t j=0; j<n; j++)
2055        0.725      140845059   33595125   37.        row_sum += A[i][j]*b[j];
2056        0.                 0          0   38.     c[i] = row_sum;
2057                                          39.    }
2058        0.                 0          0   40. }
2059
2060   In a smiliar way we can display the event counter values at the
2061instruction level.  Again we have modified the lay-out due to page width
2062limitations:
2063
2064     $ exp=mxv.hwc.def.2.thr.er
2065     $ gprofng display text -metrics e.hwc -disasm mxv_core $exp
2066
2067        Excl. CPU Excl.        Excl.
2068        Cycles    Instructions Last-Level
2069         sec.     Executed     Cache Misses
2070                                                     <Function: mxv_core>
2071        0.                 0          0  [33] 4021ba: mov   0x8(%rsp),%r10
2072                                         34.   for (uint64_t i=...) {
2073        0.                 0          0  [34] 4021bf: cmp   %rsi,%rdi
2074        0.                 0          0  [34] 4021c2: jbe   0x37
2075        0.                 0          0  [34] 4021c4: ret
2076                                         35.       double row_sum = 0.0;
2077                                         36.       for (int64_t j=0; j<n; j++)
2078                                         37.         row_sum += A[i][j]*b[j];
2079        0.                 0          0  [37] 4021c5: mov   (%r8,%rdi,8),%rdx
2080        0.                 0          0  [36] 4021c9: mov   $0x0,%eax
2081        0.                 0          0  [35] 4021ce: pxor  %xmm1,%xmm1
2082        0.002       12804230     321394  [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
2083        0.141       60819025    3866677  [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
2084        0.582       67221804   29407054  [37] 4021dd: addsd %xmm0,%xmm1
2085     ## 1.871     7279075109   87989870  [36] 4021e1: add   $0x1,%rax
2086        0.002       12804210      80351  [36] 4021e5: cmp   %rax,%rcx
2087        0.                 0          0  [36] 4021e8: jne   0xffffffffffffffea
2088                                         38.       c[i] = row_sum;
2089        0.                 0          0  [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
2090        0.                 0          0  [34] 4021f0: add   $0x1,%rdi
2091        0.                 0          0  [34] 4021f4: cmp   %rdi,%rsi
2092        0.                 0          0  [34] 4021f7: jb    0xd
2093        0.                 0          0  [35] 4021f9: pxor  %xmm1,%xmm1
2094        0.                 0          0  [36] 4021fd: test  %rcx,%rcx
2095        0.                 0      80350  [36] 402200: jne   0xffffffffffffffc5
2096        0.                 0          0  [36] 402202: jmp   0xffffffffffffffe8
2097                                         39.   }
2098                                         40. }
2099        0.                 0          0  [40]  402204:  ret
2100
2101   So far we have used the default settings for the event counters.  It
2102is quite straightforward to select specific counters.  For sake of the
2103example, let's assume we would like to count how many branch
2104instructions and retired memory load instructions that missed in the L1
2105cache have been executed.  We also want to count these events with a
2106high resolution.
2107
2108   This is the command to do so:
2109
2110     $ exe=mxv-pthreads.exe
2111     $ m=3000
2112     $ n=2000
2113     $ exp=mxv.hwc.sel.2.thr.er
2114     $ hwc1=br_ins,hi
2115     $ hwc2=mem_load_retired.l1_miss,hi
2116     $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
2117
2118   As before, we get a table with the event counts.  Due to the very
2119long name for the second counter, we have somewhat modified the output.
2120
2121     $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
2122
2123     Functions sorted by metric: Exclusive Total CPU Time
2124     Excl.     Incl.     Excl. Branch  Excl.                 Name
2125     Total     Total     Instructions  mem_load_retired.l1_miss
2126     CPU sec.  CPU sec.                Events
2127     2.597     2.597     1305305319    4021340               <Total>
2128     2.481     2.481     1233233242    3982327               mxv_core
2129     0.040     0.107       19019012       9003               init_data
2130     0.028     0.052       23023048      15006               erand48_r
2131     0.024     0.024       19019008       9004               __drand48_iterate
2132     0.015     0.067       11011009       2998               drand48
2133     0.008     0.010              0       3002               _int_malloc
2134     0.001     0.001              0          0               brk
2135     0.001     0.002              0          0               sysmalloc
2136     0.        0.001              0          0               __default_morecore
2137
2138   When using event counters, the values could be very large and it is
2139not easy to compare the numbers.  As we will show next, the 'ratio'
2140feature is very useful when comparing such profiles.
2141
2142   To demonstrate this, we have set up another event counter experiment
2143where we would like to compare the number of last level cache miss and
2144the number of branch instructions executed when using a single thread,
2145or two threads.
2146
2147   These are the commands used to generate the experiment directories:
2148
2149     $ exe=./mxv-pthreads.exe
2150     $ m=3000
2151     $ n=2000
2152     $ exp1=mxv.hwc.comp.1.thr.er
2153     $ exp2=mxv.hwc.comp.2.thr.er
2154     $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
2155     $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
2156
2157   The following script has been used to get the tables.  Due to lay-out
2158restrictions, we have to create two tables, one for each counter.
2159
2160     # Limit the output to 5 lines
2161     limit 5
2162     # Define the metrics
2163     metrics name:e.llm
2164     # Set the comparison to ratio
2165     compare ratio
2166     functions
2167     # Define the metrics
2168     metrics name:e.br_ins
2169     # Set the comparison to ratio
2170     compare ratio
2171     functions
2172
2173   Note that we print the name of the function first, followed by the
2174counter data.  The new element is that we set the comparison mode to
2175'ratio'.  This divides the data in a column by its counterpart in the
2176reference experiment.
2177
2178   This is the command using this script and the two experiment
2179directories as input:
2180
2181     $ gprofng display text -script my-script-comp-counters \
2182       mxv.hwc.comp.1.thr.er \
2183       mxv.hwc.comp.2.thr.er
2184
2185   By design, we get two tables, one for each counter:
2186
2187     Functions sorted by metric: Exclusive Last-Level Cache Misses
2188
2189                                   mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
2190     Name                          Excl. Last-Level       Excl. Last-Level
2191                                   Cache Misses           Cache Misses
2192                                                              ratio
2193      <Total>                      122709276              x   0.788
2194      mxv_core                     121796001              x   0.787
2195      init_data                       723064              x   1.055
2196      erand48_r                       100111              x   0.500
2197      drand48                          60065              x   1.167
2198
2199     Functions sorted by metric: Exclusive Branch Instructions
2200
2201                                   mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
2202     Name                          Excl. Branch           Excl. Branch
2203                                   Instructions           Instructions
2204                                                            ratio
2205      <Total>                      1307307316             x 0.997
2206      mxv_core                     1235235239             x 0.997
2207      erand48_r                      23023033             x 0.957
2208      drand48                        20020009             x 0.600
2209      __drand48_iterate              17017028             x 0.882
2210
2211   A ratio less than one in the second column, means that this counter
2212value was smaller than the value from the reference experiment shown in
2213the first column.
2214
2215   This kind of presentation of the results makes it much easier to
2216quickly interpret the data.
2217
2218   We conclude this section with thread-level event counter overviews,
2219but before we go into this, there is an important metric we need to
2220mention.
2221
2222   In case it is known how many instructions and CPU cycles have been
2223executed, the value for the IPC ("Instructions Per Clockycle") can be
2224computed.  *Note Hardware Event Counters Explained::.  This is a derived
2225metric that gives an indication how well the processor is utilized.  The
2226inverse of the IPC is called CPI.
2227
2228   The 'gprofng display text' command automatically computes the IPC and
2229CPI values if an experiment contains the event counter values for the
2230instructions and CPU cycles executed.  These are part of the metric list
2231and can be displayed, just like any other metric.
2232
2233   This can be verified through the 'metric_list' command.  If we go
2234back to our earlier experiment with the default event counters, we get
2235the following result.
2236
2237     $ gprofng display text -metric_list mxv.hwc.def.2.thr.er
2238
2239     Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
2240     Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
2241     Available metrics:
2242              Exclusive Total CPU Time: e.%totalcpu
2243              Inclusive Total CPU Time: i.%totalcpu
2244                  Exclusive CPU Cycles: e.+%cycles
2245                  Inclusive CPU Cycles: i.+%cycles
2246       Exclusive Instructions Executed: e+%insts
2247       Inclusive Instructions Executed: i+%insts
2248     Exclusive Last-Level Cache Misses: e+%llm
2249     Inclusive Last-Level Cache Misses: i+%llm
2250      Exclusive Instructions Per Cycle: e+IPC
2251      Inclusive Instructions Per Cycle: i+IPC
2252      Exclusive Cycles Per Instruction: e+CPI
2253      Inclusive Cycles Per Instruction: i+CPI
2254                                  Size: size
2255                            PC Address: address
2256                                  Name: name
2257
2258   Among the other metrics, we see the new metrics for the IPC and CPI
2259listed.
2260
2261   In the script below, we use this information and add the IPC and CPI
2262to the metrics to be displayed.  We also use a the thread filter to
2263display these values for the individual threads.
2264
2265   This is the complete script we have used.  Other than a different
2266selection of the metrics, there are no new features.
2267
2268     # Define the metrics
2269     metrics e.insts:e.%cycles:e.IPC:e.CPI
2270     # Sort with respect to cycles
2271     sort e.cycles
2272     # Limit the output to 5 lines
2273     limit 5
2274     # Get the function overview for all threads
2275     functions
2276     # Get the function overview for thread 1
2277     thread_select 1
2278     functions
2279     # Get the function overview for thread 2
2280     thread_select 2
2281     functions
2282     # Get the function overview for thread 3
2283     thread_select 3
2284     functions
2285
2286   In the metrics definition on the second line, we explicitly request
2287the counter values for the instructions ('e.insts') and CPU cycles
2288('e.cycles') executed.  These names can be found in output from the
2289'metric_list' commad above.  In addition to these metrics, we also
2290request the IPC and CPI to be shown.
2291
2292   As before, we used the 'limit' command to control the number of
2293functions displayed.  We then request an overview for all the threads,
2294followed by three sets of two commands to select a thread and display
2295the function overview.
2296
2297   The script above is used as follows:
2298
2299     $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
2300
2301   This script produces four tables.  We list them separately below, and
2302have left out the additional output.
2303
2304   The first table shows the accumulated values across the three threads
2305that have been active.
2306
2307     Functions sorted by metric: Exclusive CPU Cycles
2308
2309     Excl.         Excl. CPU     Excl.  Excl.   Name
2310     Instructions  Cycles        IPC    CPI
2311     Executed       sec.      %
2312     7906475309    2.691 100.00  1.473  0.679   <Total>
2313     7432724378    2.598  96.54  1.434  0.697   mxv_core
2314      188860269    0.035   1.31  2.682  0.373   erand48_r
2315       73623396    0.026   0.95  1.438  0.696   init_data
2316       76824434    0.018   0.66  2.182  0.458   drand48
2317
2318   This shows that IPC of this program is completely dominated by
2319function 'mxv_core'.  It has a fairly low IPC value of 1.43.
2320
2321   The next table is for thread 1 and shows the values for the main
2322thread.
2323
2324     Exp Sel Total
2325     === === =====
2326       1 1       3
2327     Functions sorted by metric: Exclusive CPU Cycles
2328
2329     Excl.         Excl. CPU     Excl.  Excl.   Name
2330     Instructions  Cycles        IPC    CPI
2331     Executed       sec.      %
2332     473750931     0.093 100.00  2.552  0.392   <Total>
2333     188860269     0.035  37.93  2.682  0.373   erand48_r
2334      73623396     0.026  27.59  1.438  0.696   init_data
2335      76824434     0.018  18.97  2.182  0.458   drand48
2336     134442832     0.013  13.79  5.250  0.190   __drand48_iterate
2337
2338   Although this thread hardly uses any CPU cycles, the overall IPC of
23392.55 is not all that bad.
2340
2341   Last, we show the tables for threads 2 and 3:
2342
2343     Exp Sel Total
2344     === === =====
2345       1 2       3
2346     Functions sorted by metric: Exclusive CPU Cycles
2347
2348     Excl.         Excl. CPU     Excl.  Excl.   Name
2349     Instructions  Cycles        IPC    CPI
2350     Executed       sec.      %
2351     3716362189    1.298 100.00  1.435  0.697   <Total>
2352     3716362189    1.298 100.00  1.435  0.697   mxv_core
2353              0    0.      0.    0.     0.      collector_root
2354              0    0.      0.    0.     0.      driver_mxv
2355
2356     Exp Sel Total
2357     === === =====
2358       1 3       3
2359     Functions sorted by metric: Exclusive CPU Cycles
2360
2361     Excl.         Excl. CPU     Excl.  Excl.   Name
2362     Instructions  Cycles        IPC    CPI
2363     Executed       sec.      %
2364     3716362189    1.300 100.00  1.433  0.698   <Total>
2365     3716362189    1.300 100.00  1.433  0.698   mxv_core
2366              0    0.      0.    0.     0.      collector_root
2367              0    0.      0.    0.     0.      driver_mxv
2368
2369   It is seen that both execute the same number of instructions and take
2370about the same number of CPU cycles.  As a result, the IPC is the same
2371for both threads.
2372
2373
2374File: gprofng.info,  Node: Java Profiling,  Prev: Profile Hardware Event Counters,  Up: A Mini Tutorial
2375
23763.5 Java Profiling
2377==================
2378
2379The 'gprofng collect app' command supports Java profiling.  The '-j on'
2380option can be used for this, but since this feature is enabled by
2381default, there is no need to set this explicitly.  Java profiling may be
2382disabled through the '-j off' option.
2383
2384   The program is compiled as usual and the experiment directory is
2385created similar to what we have seen before.  The only difference with a
2386C/C++ application is that the program has to be explicitly executed by
2387java.
2388
2389   For example, this is how to generate the experiment data for a Java
2390program that has the source code stored in file 'Pi.java':
2391
2392     $ javac Pi.java
2393     $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
2394
2395   Regarding which java is selected to generate the data, 'gprofng'
2396first looks for the JDK in the path set in either the 'JDK_HOME'
2397environment variable, or in the 'JAVA_PATH' environment variable.  If
2398neither of these variables is set, it checks for a JDK in the search
2399path (set in the PATH environment variable).  If there is no JDK in this
2400path, it checks for the java executable in '/usr/java/bin/java'.
2401
2402   In case additional options need to be passed on to the JVM, the '-J
2403<string>' option can be used.  The string with the option(s) has to be
2404delimited by quotation marks in case there is more than one argument.
2405
2406   The 'gprofng display text' command may be used to view the
2407performance data.  There is no need for any special options and the same
2408commands as previously discussed are supported.
2409
2410   The 'viewmode' command *Note The Viewmode:: is very useful to examine
2411the call stacks.
2412
2413   For example, this is how one can see the native call stacks.  For
2414lay-out purposes we have restricted the list to the first five entries:
2415
2416     $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er
2417
2418     Print limit set to 5
2419     Viewmode set to machine
2420     Functions Call Tree. Metric: Attributed Total CPU Time
2421
2422     Attr.      Name
2423     Total
2424     CPU sec.
2425     1.381      +-<Total>
2426     1.171        +-Pi.calculatePi(double)
2427     0.110        +-collector_root
2428     0.110        |  +-JavaMain
2429     0.070        |    +-jni_CallStaticVoidMethod
2430
2431Note that the selection of the viewmode is echoed in the output.
2432
2433
2434File: gprofng.info,  Node: Terminology,  Next: Other Document Formats,  Prev: A Mini Tutorial,  Up: Top
2435
24364 Terminology
2437*************
2438
2439Throughout this manual, certain terminology specific to profiling tools,
2440or 'gprofng', or even to this document only, is used.  In this chapter
2441we explain this terminology in detail.
2442
2443* Menu:
2444
2445* The Program Counter::                    What is a Program Counter?
2446* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
2447* Metric Definitions::                     Definitions associated with metrics.
2448* The Viewmode::                           Select the way call stacks are presented.
2449* The Selection List::                     How to define a selection.
2450* Load Objects and Functions::             The components in an application.
2451* The Concept of a CPU in gprofng:: The definition of a CPU.
2452* Hardware Event Counters Explained::      What are event counters?
2453* apath::                                  Our generic definition of a path.
2454
2455
2456File: gprofng.info,  Node: The Program Counter,  Next: Inclusive and Exclusive Metrics,  Up: Terminology
2457
24584.1 The Program Counter
2459=======================
2460
2461The _Program Counter_, or PC for short, keeps track where program
2462execution is.  The address of the next instruction to be executed is
2463stored in a special purpose register in the processor, or core.
2464
2465   The PC is sometimes also referred to as the _instruction pointer_,
2466but we will use Program Counter or PC throughout this document.
2467
2468
2469File: gprofng.info,  Node: Inclusive and Exclusive Metrics,  Next: Metric Definitions,  Prev: The Program Counter,  Up: Terminology
2470
24714.2 Inclusive and Exclusive Metrics
2472===================================
2473
2474In the remainder, these two concepts occur quite often and for lack of a
2475better place, they are explained here.
2476
2477   The _inclusive_ value for a metric includes all values that are part
2478of the dynamic extent of the target function.  For example if function
2479'A' calls functions 'B' and 'C', the inclusive CPU time for 'A' includes
2480the CPU time spent in 'B' and 'C'.
2481
2482   In contrast with this, the _exclusive_ value for a metric is computed
2483by excluding the metric values used by other functions called.  In our
2484imaginary example, the exclusive CPU time for function 'A' is the time
2485spent outside calling functions 'B' and 'C'.
2486
2487   In case of a _leaf function_, the inclusive and exclusive values for
2488the metric are the same since by definition, it is not calling any other
2489function(s).
2490
2491   Why do we use these two different values?  The inclusive metric shows
2492the most expensive path, in terms of this metric, in the application.
2493For example, if the metric is cache misses, the function with the
2494highest inclusive metric tells you where most of the cache misses come
2495from.
2496
2497   Within this branch of the application, the exclusive metric points to
2498the functions that contribute and help to identify which part(s) to
2499consider for further analysis.
2500
2501
2502File: gprofng.info,  Node: Metric Definitions,  Next: The Viewmode,  Prev: Inclusive and Exclusive Metrics,  Up: Terminology
2503
25044.3 Metric Definitions
2505======================
2506
2507The metrics to be shown are highly customizable.  In this section we
2508explain the definitions associated with metrics.
2509
2510   The 'metrics' command takes a colon (:) separated list with special
2511keywords.  This keyword consists of the following three fields:
2512'<flavor>''<visibility>''<metric_name>'.
2513
2514   The _<flavor>_ field is either an 'e' for "exclusive", or 'i' for
2515"inclusive".  The '<metric_name>' field is the name of the metric
2516request.  The _<visibility>_ field consists of one ore more characters
2517from the following table:
2518
2519'.'
2520     Show the metric as time.  This applies to timing metrics and
2521     hardware event counters that measure cycles.  Interpret as '+' for
2522     other metrics.
2523
2524'%'
2525     Show the metric as a percentage of the total value for this metric.
2526
2527'+'
2528     Show the metric as an absolute value.  For hardware event counters
2529     this is the event count.  Interpret as '.' for timing metrics.
2530
2531'|'
2532     Do not show any metric value.  Cannot be used with other visibility
2533     characters.
2534
2535
2536File: gprofng.info,  Node: The Viewmode,  Next: The Selection List,  Prev: Metric Definitions,  Up: Terminology
2537
25384.4 The Viewmode
2539================
2540
2541There are different ways to view a call stack in Java.  In 'gprofng',
2542this is called the _viewmode_ and the setting is controlled through a
2543command with the same name.
2544
2545   The 'viewmode' command takes one of the following keywords:
2546
2547'user'
2548     This is the default and shows the Java call stacks for Java
2549     threads.  No call stacks for any housekeeping threads are shown.
2550     The function list contains a function '<JVM-System>' that
2551     represents the aggregated time from non-Java threads.  When the JVM
2552     software does not report a Java call stack, time is reported
2553     against the function '<no Java callstack recorded>'.
2554
2555'expert'
2556     Show the Java call stacks for Java threads when the Java code from
2557     the user is executed and machine call stacks when JVM code is
2558     executed, or when the JVM software does not report a Java call
2559     stack.  Show the machine call stacks for housekeeping threads.
2560
2561'machine'
2562     Show the actual native call stacks for all threads.
2563
2564
2565File: gprofng.info,  Node: The Selection List,  Next: Load Objects and Functions,  Prev: The Viewmode,  Up: Terminology
2566
25674.5 The Selection List
2568======================
2569
2570Several commands allow the user to specify a subset of a list.  For
2571example, to select specific threads from all the threads that have been
2572used when conducting the experiment(s).
2573
2574   Such a selection list (or "list" in the remainder of this section)
2575can be a single number, a contiguous range of numbers with the start and
2576end numbers separated by a hyphen ('-'), a comma-separated list of
2577numbers and ranges, or the 'all' keyword.  Lists must not contain
2578spaces.
2579
2580   Each list can optionally be preceded by an experiment list with a
2581similar format, separated from the list by a colon (:).  If no
2582experiment list is included, the list applies to all experiments.
2583
2584   Multiple lists can be concatenated by separating the individual lists
2585by a plus sign.
2586
2587   These are some examples of various filters using a list:
2588
2589'thread_select 1'
2590     Select thread 1 from all experiments.
2591
2592'thread_select all:1'
2593     Select thread 1 from all experiments.
2594
2595'thread_select 1:1+2:2'
2596     Select thread 1 from experiment 1 and thread 2 from experiment 2.
2597
2598'cpu_select all:1,3,5'
2599     Selects cores 1, 3, and 5 from all experiments.
2600
2601'cpu_select 1,2:all'
2602     Select all cores from experiments 1 and 2, as listed by the 'by
2603     exp_list' command.
2604
2605
2606File: gprofng.info,  Node: Load Objects and Functions,  Next: The Concept of a CPU in gprofng,  Prev: The Selection List,  Up: Terminology
2607
26084.6 Load Objects and Functions
2609==============================
2610
2611An application consists of various components.  The source code files
2612are compiled into object files.  These are then glued together at link
2613time to form the executable.  During execution, the program may also
2614dynamically load objects.
2615
2616   A _load object_ is defined to be an executable, or shared object.  A
2617shared library is an example of a load object in 'gprofng'.
2618
2619   Each load object, contains a text section with the instructions
2620generated by the compiler, a data section for data, and various symbol
2621tables.  All load objects must contain an ELF symbol table, which gives
2622the names and addresses of all the globally known functions in that
2623object.
2624
2625   Load objects compiled with the -g option contain additional symbolic
2626information that can augment the ELF symbol table and provide
2627information about functions that are not global, additional information
2628about object modules from which the functions came, and line number
2629information relating addresses to source lines.
2630
2631   The term _function_ is used to describe a set of instructions that
2632represent a high-level operation described in the source code.  The term
2633also covers methods as used in C++ and in the Java programming language.
2634
2635   In the 'gprofng' context, functions are provided in source code
2636format.  Normally their names appear in the symbol table representing a
2637set of addresses.  If the Program Counter (PC) is within that set, the
2638program is executing within that function.
2639
2640   In principle, any address within the text segment of a load object
2641can be mapped to a function.  Exactly the same mapping is used for the
2642leaf PC and all the other PCs on the call stack.
2643
2644   Most of the functions correspond directly to the source model of the
2645program, but there are exceptions.  This topic is however outside of the
2646scope of this guide.
2647
2648
2649File: gprofng.info,  Node: The Concept of a CPU in gprofng,  Next: Hardware Event Counters Explained,  Prev: Load Objects and Functions,  Up: Terminology
2650
26514.7 The Concept of a CPU in gprofng
2652===================================
2653
2654In gprofng, there is the concept of a CPU. Admittedly, this is not the
2655best word to describe what is meant here and may be replaced in the
2656future.
2657
2658   The word CPU is used in many of the displays.  In the context of
2659gprofng, it is meant to denote a part of the processor that is capable
2660of executing instructions and with its own state, like the program
2661counter.
2662
2663   For example, on a contemporary processor, a CPU could be a core.  In
2664case hardware threads are supported within a core, it could be one of
2665those hardware threads.
2666
2667
2668File: gprofng.info,  Node: Hardware Event Counters Explained,  Next: apath,  Prev: The Concept of a CPU in gprofng,  Up: Terminology
2669
26704.8 Hardware Event Counters Explained
2671=====================================
2672
2673For quite a number of years now, many microprocessors have supported
2674hardware event counters.
2675
2676   On the hardware side, this means that in the processor there are one
2677or more registers dedicated to count certain activities, or "events".
2678Examples of such events are the number of instructions executed, or the
2679number of cache misses at level 2 in the memory hierarchy.
2680
2681   While there is a limited set of such registers, the user can map
2682events onto them.  In case more than one register is available, this
2683allows for the simultaenous measurement of various events.
2684
2685   A simple, yet powerful, example is to simultaneously count the number
2686of CPU cycles and the number of instructions excuted.  These two numbers
2687can then be used to compute the _IPC_ value.  IPC stands for
2688"Instructions Per Clockcycle" and each processor has a maximum.  For
2689example, if this maximum number is 2, it means the processor is capable
2690of executing two instructions every clock cycle.
2691
2692   Whether this is actually achieved, depends on several factors,
2693including the instruction characteristics.  However, in case the IPC
2694value is well below this maximum in a time critical part of the
2695application and this cannot be easily explained, further investigation
2696is probably warranted.
2697
2698   A related metric is called _CPI_, or "Clockcycles Per Instruction".
2699It is the inverse of the CPI and can be compared against the theoretical
2700value(s) of the target instruction(s).  A significant difference may
2701point at a bottleneck.
2702
2703   One thing to keep in mind is that the value returned by a counter can
2704either be the number of times the event occured, or a CPU cycle count.
2705In case of the latter it is possible to convert this number to time.
2706
2707   This is often easier to interpret than a simple count, but there is
2708one caveat to keep in mind.  The CPU frequency may not have been
2709constant while the experimen was recorded and this impacts the time
2710reported.
2711
2712   These event counters, or "counters" for short, provide great insight
2713into what happens deep inside the processor.  In case higher level
2714information does not provide the insight needed, the counters provide
2715the information to get to the bottom of a performance problem.
2716
2717   There are some things to consider though.
2718
2719   * The event definitions and names vary across processors and it may
2720     even happen that some events change with an update.  Unfortunately
2721     and this is luckily rare, there are sometimes bugs causing the
2722     wrong count to be returned.
2723
2724     In 'gprofng', some of the processor specific event names have an
2725     alias name.  For example 'insts' measures the instructions
2726     executed.  These aliases not only makes it easier to identify the
2727     functionality, but also provide portability of certain events
2728     across processors.
2729
2730   * Another complexity is that there are typically many events one can
2731     monitor.  There may up to hundreds of events available and it could
2732     require several experiments to zoom in on the root cause of a
2733     performance problem.
2734
2735   * There may be restrictions regarding the mapping of event(s) onto
2736     the counters.  For example, certain events may be restricted to
2737     specific counters only.  As a result, one may have to conduct
2738     additional experiments to cover all the events of interest.
2739
2740   * The names of the events may also not be easy to interpret.  In such
2741     cases, the description can be found in the architecture manual for
2742     the processor.
2743
2744   Despite these drawbacks, hardware event counters are extremely useful
2745and may even turn out to be indispensable.
2746
2747
2748File: gprofng.info,  Node: apath,  Prev: Hardware Event Counters Explained,  Up: Terminology
2749
27504.9 What is <apath>?
2751====================
2752
2753In most cases, 'gprofng' shows the absolute pathnames of directories.
2754These tend to be rather long, causing display issues in this document.
2755
2756   Instead of wrapping these long pathnames over multiple lines, we
2757decided to represent them by the '<apath>' symbol, which stands for "an
2758absolute pathname".
2759
2760   Note that different occurrences of '<apath>' may represent different
2761absolute pathnames.
2762
2763
2764File: gprofng.info,  Node: Other Document Formats,  Next: Index,  Prev: Terminology,  Up: Top
2765
27665 Other Document Formats
2767************************
2768
2769This document is written in Texinfo and the source text is made
2770available as part of the binutils distribution.  The file name is
2771'gprofng.texi' and can be found in subdirectory 'doc' under directory
2772'gprofng' in the top level directory.
2773
2774   This file can be used to generate the document in the 'info', 'html',
2775and 'pdf' formats.  The default installation procedure creates a file in
2776the 'info' format and stores it in the documentation section of
2777binutils.
2778
2779   The probably easiest way to generate a different format from this
2780Texinfo document is to go to the distribution directory that was created
2781when the tools were built.  This is either the default distribution
2782directory, or the one that has been set with the '--prefix' option as
2783part of the 'configure' command.  In this example we symbolize this
2784location with '<dist>'.
2785
2786   The make file called 'Makefile' in directory '<dist>/gprofng/doc'
2787supports several commands to generate this document in different
2788formats.  We recommend to use these commands.
2789
2790   They create the file(s) and install it in the documentation directory
2791of binutils, which is '<dist>/share/doc' in case 'html' or 'pdf' is
2792selected and '<dist>/share/info' for the file in the 'info' format.
2793
2794   To generate this document in the requested format and install it in
2795the documentation directory, the commands below should be executed.  In
2796this notation, '<format>' is one of 'info', 'html', or 'pdf':
2797
2798     $ cd <dist>/gprofng/doc
2799     $ make install-<format>
2800
2801Some things to note:
2802
2803   * For the 'pdf' file to be generated, the TeX document formatting
2804     software is required and the relevant commmands need to be included
2805     in the search path.  An example of a popular TeX implementation is
2806     _TexLive_.  It is beyond the scope of this document to go into the
2807     details of installing and using TeX, but it is well documented
2808     elsewhere.
2809
2810   * Instead of generating a single file in the 'html' format, it is
2811     also possible to create a directory with individual files for the
2812     various chapters.  To do so, remove the use of '--no-split' in
2813     variable 'MAKEINFOHTML' in the make file in the 'doc' directory.
2814
2815   * The make file also supports commands to only generate the file in
2816     the desired format and not move them to the documentation
2817     directory.  This is accomplished through the 'make <format>'
2818     command.
2819
2820
2821File: gprofng.info,  Node: Index,  Prev: Other Document Formats,  Up: Top
2822
2823Index
2824*****
2825
2826�[index�]
2827* Menu:
2828
2829* Command line mode:                     A First Profile.     (line  39)
2830* Commands, calltree:                    The Call Tree.       (line  10)
2831* Commands, compare delta:               Comparison of Experiments.
2832                                                              (line  52)
2833* Commands, compare on/off:              Comparison of Experiments.
2834                                                              (line   7)
2835* Commands, compare on/off <1>:          Comparison of Experiments.
2836                                                              (line  51)
2837* Commands, compare ratio:               Comparison of Experiments.
2838                                                              (line  52)
2839* Commands, compare ratio <1>:           Examples Using Hardware Event Counters.
2840                                                              (line 166)
2841* Commands, cpus:                        Commands Specific to Multithreading.
2842                                                              (line 177)
2843* Commands, cpu_list:                    Commands Specific to Multithreading.
2844                                                              (line 176)
2845* Commands, disasm:                      The Disassembly View.
2846                                                              (line   8)
2847* Commands, experiment_list:             Aggregation of Experiments.
2848                                                              (line  21)
2849* Commands, fsingle:                     Information on Load Objects.
2850                                                              (line  10)
2851* Commands, fsingle <1>:                 Information on Load Objects.
2852                                                              (line  36)
2853* Commands, fsummary:                    Information on Load Objects.
2854                                                              (line  10)
2855* Commands, fsummary <1>:                Information on Load Objects.
2856                                                              (line  62)
2857* Commands, functions:                   A First Profile.     (line  44)
2858* Commands, header:                      More Information on the Experiment.
2859                                                              (line  10)
2860* Commands, header <1>:                  Control the Sampling Frequency.
2861                                                              (line  59)
2862* Commands, limit:                       Control the Number of Lines in the Output.
2863                                                              (line   6)
2864* Commands, lines:                       The Source Code View.
2865                                                              (line  68)
2866* Commands, metrics:                     Display and Define the Metrics.
2867                                                              (line  11)
2868* Commands, metrics <1>:                 Display and Define the Metrics.
2869                                                              (line  38)
2870* Commands, metrics <2>:                 Metric Definitions.  (line   9)
2871* Commands, metric_list:                 Display and Define the Metrics.
2872                                                              (line  10)
2873* Commands, metric_list <1>:             Display and Define the Metrics.
2874                                                              (line  19)
2875* Commands, metric_list <2>:             Examples Using Hardware Event Counters.
2876                                                              (line 261)
2877* Commands, objects:                     Information on Load Objects.
2878                                                              (line  10)
2879* Commands, overview:                    More Information on the Experiment.
2880                                                              (line  54)
2881* Commands, pcs:                         The Disassembly View.
2882                                                              (line  73)
2883* Commands, script:                      Scripting.           (line  11)
2884* Commands, sort:                        Sorting the Performance Data.
2885                                                              (line   6)
2886* Commands, source:                      The Source Code View.
2887                                                              (line  12)
2888* Commands, threads:                     Commands Specific to Multithreading.
2889                                                              (line  35)
2890* Commands, thread_list:                 Commands Specific to Multithreading.
2891                                                              (line  10)
2892* Commands, thread_select:               Commands Specific to Multithreading.
2893                                                              (line  64)
2894* Commands, thread_select <1>:           Commands Specific to Multithreading.
2895                                                              (line  88)
2896* Commands, viewmode:                    Java Profiling.      (line  37)
2897* Commands, viewmode <1>:                The Viewmode.        (line   6)
2898* Compare experiments:                   Comparison of Experiments.
2899                                                              (line  10)
2900* CPI:                                   Hardware Event Counters Explained.
2901                                                              (line  31)
2902* CPU:                                   The Concept of a CPU in gprofng.
2903                                                              (line   6)
2904* Default metrics:                       Display and Define the Metrics.
2905                                                              (line  34)
2906* ELF:                                   Load Objects and Functions.
2907                                                              (line  16)
2908* Exclusive metric:                      Inclusive and Exclusive Metrics.
2909                                                              (line  14)
2910* Experiment directory:                  Steps Needed to Create a Profile.
2911                                                              (line  21)
2912* Filters, Thread selection:             Commands Specific to Multithreading.
2913                                                              (line  64)
2914* Flavor field:                          Metric Definitions.  (line  13)
2915* Function:                              Load Objects and Functions.
2916                                                              (line  26)
2917* gprofng display html:                  Steps Needed to Create a Profile.
2918                                                              (line  46)
2919* gprofng display text:                  Steps Needed to Create a Profile.
2920                                                              (line  42)
2921* Hardware event counters, alias name:   Hardware Event Counters Explained.
2922                                                              (line  57)
2923* Hardware event counters, auto option:  Examples Using Hardware Event Counters.
2924                                                              (line  21)
2925* Hardware event counters, counter definition: Getting Information on the Counters Supported.
2926                                                              (line  85)
2927* Hardware event counters, description:  Hardware Event Counters Explained.
2928                                                              (line   6)
2929* Hardware event counters, hwc metric:   Examples Using Hardware Event Counters.
2930                                                              (line  31)
2931* Hardware event counters, IPC:          Examples Using Hardware Event Counters.
2932                                                              (line 250)
2933* Hardware event counters, variable CPU frequency: Hardware Event Counters Explained.
2934                                                              (line  40)
2935* Inclusive metric:                      Inclusive and Exclusive Metrics.
2936                                                              (line   9)
2937* Instruction level timings:             The Disassembly View.
2938                                                              (line   9)
2939* Instruction pointer:                   The Program Counter. (line  10)
2940* Interpreter mode:                      A First Profile.     (line  30)
2941* IPC:                                   Hardware Event Counters Explained.
2942                                                              (line  20)
2943* Java profiling, -J <string>:           Java Profiling.      (line  29)
2944* Java profiling, -j on/off:             Java Profiling.      (line   6)
2945* Java profiling, <JVM-System>:          The Viewmode.        (line  15)
2946* Java profiling, <no Java callstack recorded>: The Viewmode. (line  18)
2947* Java profiling, different view modes:  Java Profiling.      (line  37)
2948* Java profiling, JAVA_PATH:             Java Profiling.      (line  24)
2949* Java profiling, JDK_HOME:              Java Profiling.      (line  23)
2950* Leaf function:                         Inclusive and Exclusive Metrics.
2951                                                              (line  19)
2952* List specification:                    The Selection List.  (line   6)
2953* Load object:                           Load Objects and Functions.
2954                                                              (line  11)
2955* Load objects:                          Information on Load Objects.
2956                                                              (line  11)
2957* Metric name field:                     Metric Definitions.  (line  13)
2958* Miscellaneous , ##:                    The Source Code View.
2959                                                              (line  63)
2960* Miscellaneous, <apath>:                Information on Load Objects.
2961                                                              (line  16)
2962* Miscellaneous, <Total>:                A First Profile.     (line  86)
2963* mxv-pthreads.exe:                      The Example Program. (line  12)
2964* Options, -C:                           More Information on the Experiment.
2965                                                              (line  48)
2966* Options, -h:                           Getting Information on the Counters Supported.
2967                                                              (line   9)
2968* Options, -h <1>:                       Examples Using Hardware Event Counters.
2969                                                              (line  21)
2970* Options, -o:                           Name the Experiment Directory.
2971                                                              (line  15)
2972* Options, -O:                           Name the Experiment Directory.
2973                                                              (line  19)
2974* Options, -O <1>:                       A More Elaborate Example.
2975                                                              (line  12)
2976* Options, -p:                           The Call Tree.       (line  66)
2977* Options, -p <1>:                       Control the Sampling Frequency.
2978                                                              (line  18)
2979* PC:                                    The Program Counter. (line   6)
2980* PC <1>:                                Load Objects and Functions.
2981                                                              (line  32)
2982* PC sampling:                           Sampling versus Tracing.
2983                                                              (line   7)
2984* Posix Threads:                         The Example Program. (line   8)
2985* Program Counter:                       The Program Counter. (line   6)
2986* Program Counter <1>:                   Load Objects and Functions.
2987                                                              (line  32)
2988* Program Counter sampling:              Sampling versus Tracing.
2989                                                              (line   7)
2990* Pthreads:                              The Example Program. (line   8)
2991* Sampling interval:                     Control the Sampling Frequency.
2992                                                              (line  20)
2993* Selection list:                        The Selection List.  (line   6)
2994* Source level timings:                  The Source Code View.
2995                                                              (line  10)
2996* TeX:                                   Other Document Formats.
2997                                                              (line  40)
2998* Thread affinity:                       Commands Specific to Multithreading.
2999                                                              (line 162)
3000* Total CPU time:                        A First Profile.     (line  74)
3001* Viewmode:                              The Viewmode.        (line   6)
3002* Visibility field:                      Sorting the Performance Data.
3003                                                              (line   9)
3004* Visibility field <1>:                  Metric Definitions.  (line  13)
3005
3006
3007
3008Tag Table:
3009Node: Top750
3010Node: Introduction3075
3011Node: Overview4352
3012Node: Main Features4993
3013Node: Sampling versus Tracing6733
3014Node: Steps Needed to Create a Profile9124
3015Node: A Mini Tutorial11311
3016Node: Getting Started12030
3017Node: The Example Program13680
3018Node: A First Profile14800
3019Node: The Source Code View19305
3020Node: The Disassembly View23801
3021Node: Display and Define the Metrics28490
3022Node: A First Customization of the Output30326
3023Node: Name the Experiment Directory32175
3024Node: Control the Number of Lines in the Output33272
3025Node: Sorting the Performance Data34031
3026Node: Scripting34973
3027Node: A More Elaborate Example35892
3028Node: The Call Tree38682
3029Node: More Information on the Experiment41423
3030Node: Control the Sampling Frequency44722
3031Node: Information on Load Objects47143
3032Node: Support for Multithreading50782
3033Node: Creating a Multithreading Experiment51422
3034Node: Commands Specific to Multithreading52935
3035Node: Viewing Multiple Experiments62341
3036Node: Aggregation of Experiments62918
3037Node: Comparison of Experiments65232
3038Node: Profile Hardware Event Counters69848
3039Node: Getting Information on the Counters Supported70556
3040Node: Examples Using Hardware Event Counters78156
3041Node: Java Profiling95853
3042Node: Terminology98222
3043Node: The Program Counter99234
3044Node: Inclusive and Exclusive Metrics99726
3045Node: Metric Definitions101179
3046Node: The Viewmode102364
3047Node: The Selection List103503
3048Node: Load Objects and Functions104905
3049Node: The Concept of a CPU in gprofng106919
3050Node: Hardware Event Counters Explained107680
3051Node: apath111472
3052Node: Other Document Formats112007
3053Node: Index114531
3054
3055End Tag Table
3056
3057
3058Local Variables:
3059coding: utf-8
3060End:
3061