1*c42dbd0eSchristosThis is gprofng.info, produced by makeinfo version 6.8 from 2*c42dbd0eSchristosgprofng.texi. 3*c42dbd0eSchristos 4*c42dbd0eSchristosThis document is the manual for gprofng, last updated 22 February 2022. 5*c42dbd0eSchristos 6*c42dbd0eSchristos Copyright (C) 2022 Free Software Foundation, Inc. 7*c42dbd0eSchristos 8*c42dbd0eSchristos Permission is granted to copy, distribute and/or modify this document 9*c42dbd0eSchristosunder the terms of the GNU Free Documentation License, Version 1.3 or 10*c42dbd0eSchristosany later version published by the Free Software Foundation; with no 11*c42dbd0eSchristosInvariant Sections, with no Front-Cover texts, and with no Back-Cover 12*c42dbd0eSchristosTexts. A copy of the license is included in the section entitled "GNU 13*c42dbd0eSchristosFree Documentation License." 14*c42dbd0eSchristos 15*c42dbd0eSchristosINFO-DIR-SECTION Software development 16*c42dbd0eSchristosSTART-INFO-DIR-ENTRY 17*c42dbd0eSchristos* gprofng: (gprofng). The next generation profiling tool for Linux 18*c42dbd0eSchristosEND-INFO-DIR-ENTRY 19*c42dbd0eSchristos 20*c42dbd0eSchristos 21*c42dbd0eSchristosFile: gprofng.info, Node: Top, Next: Introduction, Up: (dir) 22*c42dbd0eSchristos 23*c42dbd0eSchristosGNU Gprofng 24*c42dbd0eSchristos*********** 25*c42dbd0eSchristos 26*c42dbd0eSchristosThis document is the manual for gprofng, last updated 22 February 2022. 27*c42dbd0eSchristos 28*c42dbd0eSchristos Copyright (C) 2022 Free Software Foundation, Inc. 29*c42dbd0eSchristos 30*c42dbd0eSchristos Permission is granted to copy, distribute and/or modify this document 31*c42dbd0eSchristosunder the terms of the GNU Free Documentation License, Version 1.3 or 32*c42dbd0eSchristosany later version published by the Free Software Foundation; with no 33*c42dbd0eSchristosInvariant Sections, with no Front-Cover texts, and with no Back-Cover 34*c42dbd0eSchristosTexts. A copy of the license is included in the section entitled "GNU 35*c42dbd0eSchristosFree Documentation License." 36*c42dbd0eSchristos 37*c42dbd0eSchristos* Menu: 38*c42dbd0eSchristos 39*c42dbd0eSchristos* Introduction:: About this manual. 40*c42dbd0eSchristos* Overview:: A brief overview of gprofng. 41*c42dbd0eSchristos* A Mini Tutorial:: A short tutorial covering the key features. 42*c42dbd0eSchristos* Terminology:: Various concepts and some terminology explained. 43*c42dbd0eSchristos* Other Document Formats:: How to create this document in other formats. 44*c42dbd0eSchristos* Index:: The index. 45*c42dbd0eSchristos 46*c42dbd0eSchristos 47*c42dbd0eSchristos-- The Detailed Node Listing -- 48*c42dbd0eSchristos 49*c42dbd0eSchristosIntroduction 50*c42dbd0eSchristos 51*c42dbd0eSchristosOverview 52*c42dbd0eSchristos 53*c42dbd0eSchristos* Main Features:: A high level overview. 54*c42dbd0eSchristos* Sampling versus Tracing:: The pros and cons of sampling versus tracing. 55*c42dbd0eSchristos* Steps Needed to Create a Profile:: How to create a profile. 56*c42dbd0eSchristos 57*c42dbd0eSchristosA Mini Tutorial 58*c42dbd0eSchristos 59*c42dbd0eSchristos* Getting Started:: The basics of profiling with gprofng(). 60*c42dbd0eSchristos* Support for Multithreading:: Commands specific to multithreaded applications. 61*c42dbd0eSchristos* Viewing Multiple Experiments:: Analyze multiple experiments. 62*c42dbd0eSchristos* Profile Hardware Event Counters:: How to use hardware event counters. 63*c42dbd0eSchristos* Java Profiling:: How to profile a Java application. 64*c42dbd0eSchristos 65*c42dbd0eSchristosTerminology 66*c42dbd0eSchristos 67*c42dbd0eSchristos* The Program Counter:: What is a Program Counter? 68*c42dbd0eSchristos* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics. 69*c42dbd0eSchristos* Metric Definitions:: Definitions associated with metrics. 70*c42dbd0eSchristos* The Viewmode:: Select the way call stacks are presented. 71*c42dbd0eSchristos* The Selection List:: How to define a selection. 72*c42dbd0eSchristos* Load Objects and Functions:: The components in an application. 73*c42dbd0eSchristos* The Concept of a CPU in gprofng:: The definition of a CPU. 74*c42dbd0eSchristos* Hardware Event Counters Explained:: What are event counters? 75*c42dbd0eSchristos* apath:: Our generic definition of a path. 76*c42dbd0eSchristos 77*c42dbd0eSchristos 78*c42dbd0eSchristos 79*c42dbd0eSchristos 80*c42dbd0eSchristosFile: gprofng.info, Node: Introduction, Next: Overview, Prev: Top, Up: Top 81*c42dbd0eSchristos 82*c42dbd0eSchristos1 Introduction 83*c42dbd0eSchristos************** 84*c42dbd0eSchristos 85*c42dbd0eSchristosThe gprofng tool is the next generation profiler for Linux. It consists 86*c42dbd0eSchristosof various commands to generate and display profile information. 87*c42dbd0eSchristos 88*c42dbd0eSchristos This manual starts with a tutorial how to create and interpret a 89*c42dbd0eSchristosprofile. This part is highly practical and has the goal to get users up 90*c42dbd0eSchristosto speed as quickly as possible. As soon as possible, we would like to 91*c42dbd0eSchristosshow you how to get your first profile on your screen. 92*c42dbd0eSchristos 93*c42dbd0eSchristos This is followed by more examples, covering many of the features. At 94*c42dbd0eSchristosthe end of this tutorial, you should feel confident enough to tackle the 95*c42dbd0eSchristosmore complex tasks. 96*c42dbd0eSchristos 97*c42dbd0eSchristos In a future update a more formal reference manual will be included as 98*c42dbd0eSchristoswell. Since even in this tutorial we use certain terminology, we have 99*c42dbd0eSchristosincluded a chapter with descriptions at the end. In case you encounter 100*c42dbd0eSchristosunfamiliar wordings or terminology, please check this chapter. 101*c42dbd0eSchristos 102*c42dbd0eSchristos One word of caution. In several cases we had to somewhat tweak the 103*c42dbd0eSchristosscreen output in order to make it fit. This is why the output may look 104*c42dbd0eSchristossomewhat different when you try things yourself. 105*c42dbd0eSchristos 106*c42dbd0eSchristos For now, we wish you a smooth profiling experience with gprofng and 107*c42dbd0eSchristosgood luck tackling performance bottlenecks. 108*c42dbd0eSchristos 109*c42dbd0eSchristos 110*c42dbd0eSchristosFile: gprofng.info, Node: Overview, Next: A Mini Tutorial, Prev: Introduction, Up: Top 111*c42dbd0eSchristos 112*c42dbd0eSchristos2 A Brief Overview of gprofng 113*c42dbd0eSchristos***************************** 114*c42dbd0eSchristos 115*c42dbd0eSchristos* Menu: 116*c42dbd0eSchristos 117*c42dbd0eSchristos* Main Features:: A high level overview. 118*c42dbd0eSchristos* Sampling versus Tracing:: The pros and cons of sampling versus tracing. 119*c42dbd0eSchristos* Steps Needed to Create a Profile:: How to create a profile. 120*c42dbd0eSchristos 121*c42dbd0eSchristosBefore we cover this tool in quite some detail, we start with a brief 122*c42dbd0eSchristosoverview of what it is, and the main features. Since we know that many 123*c42dbd0eSchristosof you would like to get started rightaway, already in this first 124*c42dbd0eSchristoschapter we explain the basics of profiling with 'gprofng'. 125*c42dbd0eSchristos 126*c42dbd0eSchristos 127*c42dbd0eSchristosFile: gprofng.info, Node: Main Features, Next: Sampling versus Tracing, Up: Overview 128*c42dbd0eSchristos 129*c42dbd0eSchristos2.1 Main Features 130*c42dbd0eSchristos================= 131*c42dbd0eSchristos 132*c42dbd0eSchristosThese are the main features of the gprofng tool: 133*c42dbd0eSchristos 134*c42dbd0eSchristos * Profiling is supported for an application written in C, C++, Java, 135*c42dbd0eSchristos or Scala. 136*c42dbd0eSchristos 137*c42dbd0eSchristos * Shared libraries are supported. The information is presented at 138*c42dbd0eSchristos the instruction level. 139*c42dbd0eSchristos 140*c42dbd0eSchristos * The following multithreading programming models are supported: 141*c42dbd0eSchristos Pthreads, OpenMP, and Java threads. 142*c42dbd0eSchristos 143*c42dbd0eSchristos * This tool works with unmodified production level executables. 144*c42dbd0eSchristos There is no need to recompile the code, but if the '-g' option has 145*c42dbd0eSchristos been used when building the application, source line level 146*c42dbd0eSchristos information is available. 147*c42dbd0eSchristos 148*c42dbd0eSchristos * The focus is on support for code generated with the 'gcc' compiler, 149*c42dbd0eSchristos but there is some limited support for the 'icc' compiler as well. 150*c42dbd0eSchristos Future improvements and enhancements will focus on 'gcc' though. 151*c42dbd0eSchristos 152*c42dbd0eSchristos * Processors from Intel, AMD, and Arm are supported, but the level of 153*c42dbd0eSchristos support depends on the architectural details. In particular, 154*c42dbd0eSchristos hardware event counters may not be supported. 155*c42dbd0eSchristos 156*c42dbd0eSchristos * Several views into the data are supported. For example, a function 157*c42dbd0eSchristos overview where the time is spent, but also a source line, 158*c42dbd0eSchristos disassembly, call tree and a caller-callees overview are available. 159*c42dbd0eSchristos 160*c42dbd0eSchristos * Through filters, the user can zoom in on an area of interest. 161*c42dbd0eSchristos 162*c42dbd0eSchristos * Two or more profiles can be aggregated, or used in a comparison. 163*c42dbd0eSchristos This comparison can be obtained at the function, source line, and 164*c42dbd0eSchristos disassembly level. 165*c42dbd0eSchristos 166*c42dbd0eSchristos * Through a scripting language, and customization of the metrics 167*c42dbd0eSchristos shown, the generation and creation of a profile can be fully 168*c42dbd0eSchristos automated and provide tailored output. 169*c42dbd0eSchristos 170*c42dbd0eSchristos 171*c42dbd0eSchristosFile: gprofng.info, Node: Sampling versus Tracing, Next: Steps Needed to Create a Profile, Prev: Main Features, Up: Overview 172*c42dbd0eSchristos 173*c42dbd0eSchristos2.2 Sampling versus Tracing 174*c42dbd0eSchristos=========================== 175*c42dbd0eSchristos 176*c42dbd0eSchristosA key difference with some other profiling tools is that the main data 177*c42dbd0eSchristoscollection command 'gprofng collect app' mostly uses Program Counter 178*c42dbd0eSchristos(PC) sampling under the hood. 179*c42dbd0eSchristos 180*c42dbd0eSchristos With _sampling_, the executable is stopped at regular intervals. 181*c42dbd0eSchristosEach time it is halted, key information is gathered and stored. This 182*c42dbd0eSchristosincludes the Program Counter that keeps track of where the execution is. 183*c42dbd0eSchristosHence the name. 184*c42dbd0eSchristos 185*c42dbd0eSchristos Together with operational data, this information is stored in the 186*c42dbd0eSchristosexperiment directory and can be viewed in the second phase. 187*c42dbd0eSchristos 188*c42dbd0eSchristos For example, the PC information is used to derive where the program 189*c42dbd0eSchristoswas when it was halted. Since the sampling interval is known, it is 190*c42dbd0eSchristosrelatively easy to derive how much time was spent in the various parts 191*c42dbd0eSchristosof the program. 192*c42dbd0eSchristos 193*c42dbd0eSchristos The opposite technique is generally referred to as _tracing_. With 194*c42dbd0eSchristostracing, the target is instrumented with specific calls that collect the 195*c42dbd0eSchristosrequested information. 196*c42dbd0eSchristos 197*c42dbd0eSchristos These are some of the pros and cons of PC sampling verus tracing: 198*c42dbd0eSchristos 199*c42dbd0eSchristos * Since there is no need to recompile, existing executables can be 200*c42dbd0eSchristos used and the profile measures the behaviour of exactly the same 201*c42dbd0eSchristos executable that is used in production runs. 202*c42dbd0eSchristos 203*c42dbd0eSchristos With sampling, one inherently profiles a different executable 204*c42dbd0eSchristos because the calls to the instrumentation library may affect the 205*c42dbd0eSchristos compiler optimizations and run time behaviour. 206*c42dbd0eSchristos 207*c42dbd0eSchristos * With sampling, there are very few restrictions on what can be 208*c42dbd0eSchristos profiled and even without access to the source code, a basic 209*c42dbd0eSchristos profile can be made. 210*c42dbd0eSchristos 211*c42dbd0eSchristos * A downside of sampling is that, depending on the sampling 212*c42dbd0eSchristos frequency, small functions may be missed or not captured 213*c42dbd0eSchristos accurately. Although this is rare, this may happen and is the 214*c42dbd0eSchristos reason why the user has control over the sampling rate. 215*c42dbd0eSchristos 216*c42dbd0eSchristos * While tracing produces precise information, sampling is statistical 217*c42dbd0eSchristos in nature. As a result, small variations may occur across 218*c42dbd0eSchristos seemingly identical runs. We have not observed more than a few 219*c42dbd0eSchristos percent deviation though. Especially if the target job executed 220*c42dbd0eSchristos for a sufficiently long time. 221*c42dbd0eSchristos 222*c42dbd0eSchristos * With sampling, it is not possible to get an accurate count how 223*c42dbd0eSchristos often functions are called. 224*c42dbd0eSchristos 225*c42dbd0eSchristos 226*c42dbd0eSchristosFile: gprofng.info, Node: Steps Needed to Create a Profile, Prev: Sampling versus Tracing, Up: Overview 227*c42dbd0eSchristos 228*c42dbd0eSchristos2.3 Steps Needed to Create a Profile 229*c42dbd0eSchristos==================================== 230*c42dbd0eSchristos 231*c42dbd0eSchristosCreating a profile takes two steps. First the profile data needs to be 232*c42dbd0eSchristosgenerated. This is followed by a viewing step to create a report from 233*c42dbd0eSchristosthe information that has been gathered. 234*c42dbd0eSchristos 235*c42dbd0eSchristos Every gprofng command starts with 'gprofng', the name of the driver. 236*c42dbd0eSchristosThis is followed by a keyword to define the high level functionality. 237*c42dbd0eSchristosDepending on this keyword, a third qualifier may be needed to further 238*c42dbd0eSchristosnarrow down the request. This combination is then followed by options 239*c42dbd0eSchristosthat are specific to the functionality desired. 240*c42dbd0eSchristos 241*c42dbd0eSchristos The command to gather, or "collect", the performance data is called 242*c42dbd0eSchristos'gprofng collect app'. Aside from numerous options, this command takes 243*c42dbd0eSchristosthe name of the target executable as an input parameter. 244*c42dbd0eSchristos 245*c42dbd0eSchristos Upon completion of the run, the performance data can be found in the 246*c42dbd0eSchristosnewly created experiment directory. 247*c42dbd0eSchristos 248*c42dbd0eSchristos Unless explicitly specified otherwise, a default name for this 249*c42dbd0eSchristosdirectory is chosen. The name is 'test.<n>.er' where 'n' is the first 250*c42dbd0eSchristosinteger number not in use yet for such a name. 251*c42dbd0eSchristos 252*c42dbd0eSchristos For example, the first time 'gprofng collect app' is invoked, an 253*c42dbd0eSchristosexperiment directory with the name 'test.1.er' is created. 254*c42dbd0eSchristos 255*c42dbd0eSchristos Upon a subsequent invocation of 'gprofng collect app' in the same 256*c42dbd0eSchristosdirectory, an experiment directory with the name 'test.2.er' will be 257*c42dbd0eSchristoscreated, and so forth. 258*c42dbd0eSchristos 259*c42dbd0eSchristos Note that 'gprofng collect app' supports an option to explicitly name 260*c42dbd0eSchristosthe experiment directory. Outside of the restriction that the name of 261*c42dbd0eSchristosthis directory has to end with '.er', any valid directory name can be 262*c42dbd0eSchristosused for this. 263*c42dbd0eSchristos 264*c42dbd0eSchristos Now that we have the performance data, the next step is to display 265*c42dbd0eSchristosit. 266*c42dbd0eSchristos 267*c42dbd0eSchristos The most commonly used command to view the performance information is 268*c42dbd0eSchristos'gprofng display text'. This is a very extensive and customizable tool 269*c42dbd0eSchristosthat produces the information in ASCII format. 270*c42dbd0eSchristos 271*c42dbd0eSchristos Another option is to use 'gprofng display html'. This tool generates 272*c42dbd0eSchristosa directory with files in html format. These can be viewed in a 273*c42dbd0eSchristosbrowser, allowing for easy navigation through the profile data. 274*c42dbd0eSchristos 275*c42dbd0eSchristos 276*c42dbd0eSchristosFile: gprofng.info, Node: A Mini Tutorial, Next: Terminology, Prev: Overview, Up: Top 277*c42dbd0eSchristos 278*c42dbd0eSchristos3 A Mini Tutorial 279*c42dbd0eSchristos***************** 280*c42dbd0eSchristos 281*c42dbd0eSchristosIn this chapter we present and discuss the main functionality of 282*c42dbd0eSchristos'gprofng'. This will be a practical approach, using an example code to 283*c42dbd0eSchristosgenerate profile data and show how to get various performance reports. 284*c42dbd0eSchristos 285*c42dbd0eSchristos* Menu: 286*c42dbd0eSchristos 287*c42dbd0eSchristos* Getting Started:: The basics of profiling with gprofng(). 288*c42dbd0eSchristos* Support for Multithreading:: Commands specific to multithreaded applications. 289*c42dbd0eSchristos* Viewing Multiple Experiments:: Analyze multiple experiments. 290*c42dbd0eSchristos* Profile Hardware Event Counters:: How to use hardware event counters. 291*c42dbd0eSchristos* Java Profiling:: How to profile a Java application. 292*c42dbd0eSchristos 293*c42dbd0eSchristos 294*c42dbd0eSchristosFile: gprofng.info, Node: Getting Started, Next: Support for Multithreading, Up: A Mini Tutorial 295*c42dbd0eSchristos 296*c42dbd0eSchristos3.1 Getting Started 297*c42dbd0eSchristos=================== 298*c42dbd0eSchristos 299*c42dbd0eSchristosThe information presented here provides a good and common basis for many 300*c42dbd0eSchristosprofiling tasks, but there are more features that you may want to 301*c42dbd0eSchristosleverage. 302*c42dbd0eSchristos 303*c42dbd0eSchristos These are covered in subsequent sections in this chapter. 304*c42dbd0eSchristos 305*c42dbd0eSchristos* Menu: 306*c42dbd0eSchristos 307*c42dbd0eSchristos* The Example Program:: A description of the example program used. 308*c42dbd0eSchristos* A First Profile:: How to get the first profile. 309*c42dbd0eSchristos* The Source Code View:: Display the metrics in the source code. 310*c42dbd0eSchristos* The Disassembly View:: Display the metrics at the instruction level. 311*c42dbd0eSchristos* Display and Define the Metrics:: An example how to customize the metrics. 312*c42dbd0eSchristos* A First Customization of the Output:: An example how to customize the output. 313*c42dbd0eSchristos* Name the Experiment Directory:: Change the name of the experiment directory. 314*c42dbd0eSchristos* Control the Number of Lines in the Output:: Change the number of lines in the tables. 315*c42dbd0eSchristos* Sorting the Performance Data:: How to set the metric to sort by. 316*c42dbd0eSchristos* Scripting:: Use a script to execute the commands. 317*c42dbd0eSchristos* A More Elaborate Example:: An example of customization. 318*c42dbd0eSchristos* The Call Tree:: Display the dynamic call tree. 319*c42dbd0eSchristos* More Information on the Experiment:: How to get additional statistics. 320*c42dbd0eSchristos* Control the Sampling Frequency:: How to control the sampling granularity. 321*c42dbd0eSchristos* Information on Load Objects:: How to get more information on load objects. 322*c42dbd0eSchristos 323*c42dbd0eSchristos 324*c42dbd0eSchristosFile: gprofng.info, Node: The Example Program, Next: A First Profile, Up: Getting Started 325*c42dbd0eSchristos 326*c42dbd0eSchristos3.1.1 The Example Program 327*c42dbd0eSchristos------------------------- 328*c42dbd0eSchristos 329*c42dbd0eSchristosThroughout this guide we use the same example C code that implements the 330*c42dbd0eSchristosmultiplication of a vector of length n by an m by n matrix. The result 331*c42dbd0eSchristosis stored in a vector of length m. The algorithm has been parallelized 332*c42dbd0eSchristosusing Posix Threads, or Pthreads for short. 333*c42dbd0eSchristos 334*c42dbd0eSchristos The code was built using the 'gcc' compiler and the name of the 335*c42dbd0eSchristosexecutable is mxv-pthreads.exe. 336*c42dbd0eSchristos 337*c42dbd0eSchristos The matrix sizes can be set through the '-m' and '-n' options. The 338*c42dbd0eSchristosnumber of threads is set with the '-t' option. To increase the duration 339*c42dbd0eSchristosof the run, the multiplication is executed repeatedly. 340*c42dbd0eSchristos 341*c42dbd0eSchristos This is an example that multiplies a 3000 by 2000 matrix with a 342*c42dbd0eSchristosvector of length 2000 using 2 threads: 343*c42dbd0eSchristos 344*c42dbd0eSchristos $ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2 345*c42dbd0eSchristos mxv: error check passed - rows = 3000 columns = 2000 threads = 2 346*c42dbd0eSchristos $ 347*c42dbd0eSchristos 348*c42dbd0eSchristos The program performs an internal check to verify the results are 349*c42dbd0eSchristoscorrect. The result of this check is printed, followed by the matrix 350*c42dbd0eSchristossizes and the number of threads used. 351*c42dbd0eSchristos 352*c42dbd0eSchristos 353*c42dbd0eSchristosFile: gprofng.info, Node: A First Profile, Next: The Source Code View, Prev: The Example Program, Up: Getting Started 354*c42dbd0eSchristos 355*c42dbd0eSchristos3.1.2 A First Profile 356*c42dbd0eSchristos--------------------- 357*c42dbd0eSchristos 358*c42dbd0eSchristosThe first step is to collect the performance data. It is important to 359*c42dbd0eSchristosremember that much more information is gathered than may be shown by 360*c42dbd0eSchristosdefault. Often a single data collection run is sufficient to get a lot 361*c42dbd0eSchristosof insight. 362*c42dbd0eSchristos 363*c42dbd0eSchristos The 'gprofng collect app' command is used for the data collection. 364*c42dbd0eSchristosNothing needs to be changed in the way the application is executed. The 365*c42dbd0eSchristosonly difference is that it is now run under control of the tool, as 366*c42dbd0eSchristosshown below: 367*c42dbd0eSchristos 368*c42dbd0eSchristos $ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1 369*c42dbd0eSchristos 370*c42dbd0eSchristos This command produces the following output: 371*c42dbd0eSchristos 372*c42dbd0eSchristos Creating experiment database test.1.er (Process ID: 2416504) ... 373*c42dbd0eSchristos mxv: error check passed - rows = 3000 columns = 2000 threads = 1 374*c42dbd0eSchristos 375*c42dbd0eSchristos We see the message that a directory with the name 'test.1.er' has 376*c42dbd0eSchristosbeen created. The application then completes as usual and we have our 377*c42dbd0eSchristosfirst experiment directory that can be analyzed. 378*c42dbd0eSchristos 379*c42dbd0eSchristos The tool we use for this is called 'gprofng display text'. It takes 380*c42dbd0eSchristosthe name of the experiment directory as an argument. 381*c42dbd0eSchristos 382*c42dbd0eSchristos If invoked this way, the tool starts in the interactive _interpreter_ 383*c42dbd0eSchristosmode. While in this environment, commands can be given and the tool 384*c42dbd0eSchristosresponds. This is illustrated below: 385*c42dbd0eSchristos 386*c42dbd0eSchristos $ gprofng display text test.1.er 387*c42dbd0eSchristos Warning: History and command editing is not supported on this system. 388*c42dbd0eSchristos (gp-display-text) quit 389*c42dbd0eSchristos $ 390*c42dbd0eSchristos 391*c42dbd0eSchristos While useful in certain cases, we prefer to use this tool in command 392*c42dbd0eSchristosline mode, by specifying the commands to be issued when invoking the 393*c42dbd0eSchristostool. The way to do this is to prepend the command with a hyphen ('-') 394*c42dbd0eSchristosif used on the command line. 395*c42dbd0eSchristos 396*c42dbd0eSchristos For example, with the 'functions' command we request a list of the 397*c42dbd0eSchristosfunctions that have been executed and their respective CPU times: 398*c42dbd0eSchristos 399*c42dbd0eSchristos $ gprofng display text -functions test.1.er 400*c42dbd0eSchristos 401*c42dbd0eSchristos $ gprofng display text -functions test.1.er 402*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 403*c42dbd0eSchristos 404*c42dbd0eSchristos Excl. Incl. Name 405*c42dbd0eSchristos Total Total 406*c42dbd0eSchristos CPU sec. CPU sec. 407*c42dbd0eSchristos 2.272 2.272 <Total> 408*c42dbd0eSchristos 2.160 2.160 mxv_core 409*c42dbd0eSchristos 0.047 0.103 init_data 410*c42dbd0eSchristos 0.030 0.043 erand48_r 411*c42dbd0eSchristos 0.013 0.013 __drand48_iterate 412*c42dbd0eSchristos 0.013 0.056 drand48 413*c42dbd0eSchristos 0.008 0.010 _int_malloc 414*c42dbd0eSchristos 0.001 0.001 brk 415*c42dbd0eSchristos 0.001 0.002 sysmalloc 416*c42dbd0eSchristos 0. 0.001 __default_morecore 417*c42dbd0eSchristos 0. 0.113 __libc_start_main 418*c42dbd0eSchristos 0. 0.010 allocate_data 419*c42dbd0eSchristos 0. 2.160 collector_root 420*c42dbd0eSchristos 0. 2.160 driver_mxv 421*c42dbd0eSchristos 0. 0.113 main 422*c42dbd0eSchristos 0. 0.010 malloc 423*c42dbd0eSchristos 0. 0.001 sbrk 424*c42dbd0eSchristos 425*c42dbd0eSchristos As easy and simple as these steps are, we do have a first profile of 426*c42dbd0eSchristosour program! There are three columns. The first two contain the _Total 427*c42dbd0eSchristosCPU Time_, which is the sum of the user and system time. *Note 428*c42dbd0eSchristosInclusive and Exclusive Metrics:: for an explanation of "exclusive" and 429*c42dbd0eSchristos"inclusive" times. 430*c42dbd0eSchristos 431*c42dbd0eSchristos The first line echoes the metric that is used to sort the output. By 432*c42dbd0eSchristosdefault, this is the exclusive CPU time, but the sort metric can be 433*c42dbd0eSchristoschanged by the user. 434*c42dbd0eSchristos 435*c42dbd0eSchristos We then see three columns with the exclusive and inclusive CPU times, 436*c42dbd0eSchristosplus the name of the function. 437*c42dbd0eSchristos 438*c42dbd0eSchristos The function with the name '<Total>' is not a user function, but is 439*c42dbd0eSchristosintroduced by 'gprofng' and is used to display the accumulated metric 440*c42dbd0eSchristosvalues. In this case, we see that the total CPU time of this job was 441*c42dbd0eSchristos'2.272' seconds. 442*c42dbd0eSchristos 443*c42dbd0eSchristos With '2.160' seconds, function 'mxv_core' is the most time consuming 444*c42dbd0eSchristosfunction. It is also a leaf function. 445*c42dbd0eSchristos 446*c42dbd0eSchristos The next function in the list is 'init_data'. Although the CPU time 447*c42dbd0eSchristosspent in this part is negligible, this is an interesting entry because 448*c42dbd0eSchristosthe inclusive CPU time of '0.103' seconds is higher than the exclusive 449*c42dbd0eSchristosCPU time of '0.047' seconds. Clearly it is calling another function, or 450*c42dbd0eSchristoseven more than one function. *Note The Call Tree:: for the details how 451*c42dbd0eSchristosto get more information on this. 452*c42dbd0eSchristos 453*c42dbd0eSchristos The function 'collector_root' does not look familiar. It is one of 454*c42dbd0eSchristosthe internal functions used by 'gprofng collect app' and can be ignored. 455*c42dbd0eSchristosWhile the inclusive time is high, the exclusive time is zero. This 456*c42dbd0eSchristosmeans it doesn't contribute to the performance. 457*c42dbd0eSchristos 458*c42dbd0eSchristos The question is how we know where this function originates from? 459*c42dbd0eSchristosThere is a very useful command to get more details on a function. *Note 460*c42dbd0eSchristosInformation on Load Objects::. 461*c42dbd0eSchristos 462*c42dbd0eSchristos 463*c42dbd0eSchristosFile: gprofng.info, Node: The Source Code View, Next: The Disassembly View, Prev: A First Profile, Up: Getting Started 464*c42dbd0eSchristos 465*c42dbd0eSchristos3.1.3 The Source Code View 466*c42dbd0eSchristos-------------------------- 467*c42dbd0eSchristos 468*c42dbd0eSchristosIn general, you would like to focus the tuning efforts on the most time 469*c42dbd0eSchristosconsuming part(s) of the program. In this case that is easy, since 470*c42dbd0eSchristos2.160 seconds on a total of 2.272 seconds is spent in function 471*c42dbd0eSchristos'mxv_core'. That is 95% of the total and it is time to dig deeper and 472*c42dbd0eSchristoslook at the time distribution at the source code level. 473*c42dbd0eSchristos 474*c42dbd0eSchristos The 'source' command is used to accomplish this. It takes the name 475*c42dbd0eSchristosof the function, not the source filename, as an argument. This is 476*c42dbd0eSchristosdemonstrated below, where the 'gprofng display text' command is used to 477*c42dbd0eSchristosshow the annotated source listing of function 'mxv_core'. 478*c42dbd0eSchristos 479*c42dbd0eSchristos Please note that the source code has to be compiled with the '-g' 480*c42dbd0eSchristosoption in order for the source code feature to work. Otherwise the 481*c42dbd0eSchristoslocation can not be determined. 482*c42dbd0eSchristos 483*c42dbd0eSchristos $ gprofng display text -source mxv_core test.1.er 484*c42dbd0eSchristos 485*c42dbd0eSchristos The slightly modified output is as follows: 486*c42dbd0eSchristos 487*c42dbd0eSchristos Source file: <apath>/mxv.c 488*c42dbd0eSchristos Object file: mxv-pthreads.exe (found as test.1.er/archives/...) 489*c42dbd0eSchristos Load Object: mxv-pthreads.exe (found as test.1.er/archives/...) 490*c42dbd0eSchristos 491*c42dbd0eSchristos Excl. Incl. 492*c42dbd0eSchristos Total Total 493*c42dbd0eSchristos CPU sec. CPU sec. 494*c42dbd0eSchristos 495*c42dbd0eSchristos <lines deleted> 496*c42dbd0eSchristos <Function: mxv_core> 497*c42dbd0eSchristos 0. 0. 32. void __attribute__ ((noinline)) 498*c42dbd0eSchristos mxv_core ( 499*c42dbd0eSchristos uint64_t row_index_start, 500*c42dbd0eSchristos uint64_t row_index_end, 501*c42dbd0eSchristos uint64_t m, uint64_t n, 502*c42dbd0eSchristos double **restrict A, 503*c42dbd0eSchristos double *restrict b, 504*c42dbd0eSchristos double *restrict c) 505*c42dbd0eSchristos 0. 0. 33. { 506*c42dbd0eSchristos 0. 0. 34. for (uint64_t i=row_index_start; 507*c42dbd0eSchristos i<=row_index_end; i++) { 508*c42dbd0eSchristos 0. 0. 35. double row_sum = 0.0; 509*c42dbd0eSchristos ## 1.687 1.687 36. for (int64_t j=0; j<n; j++) 510*c42dbd0eSchristos 0.473 0.473 37. row_sum += A[i][j]*b[j]; 511*c42dbd0eSchristos 0. 0. 38. c[i] = row_sum; 512*c42dbd0eSchristos 39. } 513*c42dbd0eSchristos 0. 0. 40. } 514*c42dbd0eSchristos 515*c42dbd0eSchristos The first three lines provide information on the location of the 516*c42dbd0eSchristossource file, the object file and the load object (*Note Load Objects and 517*c42dbd0eSchristosFunctions::). 518*c42dbd0eSchristos 519*c42dbd0eSchristos Function 'mxv_core' is part of a source file that has other functions 520*c42dbd0eSchristosas well. These functions will be shown, but without timing information. 521*c42dbd0eSchristosThey have been removed in the output shown above. 522*c42dbd0eSchristos 523*c42dbd0eSchristos This is followed by the annotated source code listing. The selected 524*c42dbd0eSchristosmetrics are shown first, followed by a source line number, and the 525*c42dbd0eSchristossource code. The most time consuming line(s) are marked with the '##' 526*c42dbd0eSchristossymbol. In this way they are easier to find. 527*c42dbd0eSchristos 528*c42dbd0eSchristos What we see is that all of the time is spent in lines 36-37. 529*c42dbd0eSchristos 530*c42dbd0eSchristos A related command sometimes comes handy as well. It is called 531*c42dbd0eSchristos'lines' and displays a list of the source lines and their metrics, 532*c42dbd0eSchristosordered according to the current sort metric (*Note Sorting the 533*c42dbd0eSchristosPerformance Data::). 534*c42dbd0eSchristos 535*c42dbd0eSchristos Below the command and the output. For lay-out reasons, only the top 536*c42dbd0eSchristos10 is shown here and the last part of the text on some lines has been 537*c42dbd0eSchristosreplaced by dots. 538*c42dbd0eSchristos 539*c42dbd0eSchristos $ gprofng display text -lines test.1.er 540*c42dbd0eSchristos 541*c42dbd0eSchristos Lines sorted by metric: Exclusive Total CPU Time 542*c42dbd0eSchristos 543*c42dbd0eSchristos Excl. Incl. Name 544*c42dbd0eSchristos Total Total 545*c42dbd0eSchristos CPU sec. CPU sec. 546*c42dbd0eSchristos 2.272 2.272 <Total> 547*c42dbd0eSchristos 1.687 1.687 mxv_core, line 36 in "mxv.c" 548*c42dbd0eSchristos 0.473 0.473 mxv_core, line 37 in "mxv.c" 549*c42dbd0eSchristos 0.032 0.088 init_data, line 72 in "manage_data.c" 550*c42dbd0eSchristos 0.030 0.043 <Function: erand48_r, instructions without line numbers> 551*c42dbd0eSchristos 0.013 0.013 <Function: __drand48_iterate, instructions without ...> 552*c42dbd0eSchristos 0.013 0.056 <Function: drand48, instructions without line numbers> 553*c42dbd0eSchristos 0.012 0.012 init_data, line 77 in "manage_data.c" 554*c42dbd0eSchristos 0.008 0.010 <Function: _int_malloc, instructions without ...> 555*c42dbd0eSchristos 0.003 0.003 init_data, line 71 in "manage_data.c" 556*c42dbd0eSchristos 557*c42dbd0eSchristos What this overview immediately highlights is that the next most time 558*c42dbd0eSchristosconsuming source line takes 0.032 seconds only. With an inclusive time 559*c42dbd0eSchristosof 0.088 seconds, it is also clear that this branch of the code does not 560*c42dbd0eSchristosimpact the performance. 561*c42dbd0eSchristos 562*c42dbd0eSchristos 563*c42dbd0eSchristosFile: gprofng.info, Node: The Disassembly View, Next: Display and Define the Metrics, Prev: The Source Code View, Up: Getting Started 564*c42dbd0eSchristos 565*c42dbd0eSchristos3.1.4 The Disassembly View 566*c42dbd0eSchristos-------------------------- 567*c42dbd0eSchristos 568*c42dbd0eSchristosThe source view is very useful to obtain more insight where the time is 569*c42dbd0eSchristosspent, but sometimes this is not sufficient. This is when the 570*c42dbd0eSchristosdisassembly view comes in. It is activated with the 'disasm' command 571*c42dbd0eSchristosand as with the source view, it displays an annotated listing. In this 572*c42dbd0eSchristoscase it shows the instructions with the metrics, interleaved with the 573*c42dbd0eSchristossource lines. The instructions have a reference in square brackets ('[' 574*c42dbd0eSchristosand ']') to the source line they correspond to. 575*c42dbd0eSchristos 576*c42dbd0eSchristosThis is what we get for our example: 577*c42dbd0eSchristos 578*c42dbd0eSchristos $ gprofng display text -disasm mxv_core test.1.er 579*c42dbd0eSchristos 580*c42dbd0eSchristos Source file: <apath>/mxv.c 581*c42dbd0eSchristos Object file: mxv-pthreads.exe (found as test.1.er/archives/...) 582*c42dbd0eSchristos Load Object: mxv-pthreads.exe (found as test.1.er/archives/...) 583*c42dbd0eSchristos 584*c42dbd0eSchristos Excl. Incl. 585*c42dbd0eSchristos Total Total 586*c42dbd0eSchristos CPU sec. CPU sec. 587*c42dbd0eSchristos 588*c42dbd0eSchristos <lines deleted> 589*c42dbd0eSchristos 32. void __attribute__ ((noinline)) 590*c42dbd0eSchristos mxv_core ( 591*c42dbd0eSchristos uint64_t row_index_start, 592*c42dbd0eSchristos uint64_t row_index_end, 593*c42dbd0eSchristos uint64_t m, uint64_t n, 594*c42dbd0eSchristos double **restrict A, 595*c42dbd0eSchristos double *restrict b, 596*c42dbd0eSchristos double *restrict c) 597*c42dbd0eSchristos 33. { 598*c42dbd0eSchristos <Function: mxv_core> 599*c42dbd0eSchristos 0. 0. [33] 4021ba: mov 0x8(%rsp),%r10 600*c42dbd0eSchristos 34. for (uint64_t i=row_index_start; 601*c42dbd0eSchristos i<=row_index_end; i++) { 602*c42dbd0eSchristos 0. 0. [34] 4021bf: cmp %rsi,%rdi 603*c42dbd0eSchristos 0. 0. [34] 4021c2: jbe 0x37 604*c42dbd0eSchristos 0. 0. [34] 4021c4: ret 605*c42dbd0eSchristos 35. double row_sum = 0.0; 606*c42dbd0eSchristos 36. for (int64_t j=0; j<n; j++) 607*c42dbd0eSchristos 37. row_sum += A[i][j]*b[j]; 608*c42dbd0eSchristos 0. 0. [37] 4021c5: mov (%r8,%rdi,8),%rdx 609*c42dbd0eSchristos 0. 0. [36] 4021c9: mov $0x0,%eax 610*c42dbd0eSchristos 0. 0. [35] 4021ce: pxor %xmm1,%xmm1 611*c42dbd0eSchristos 0.002 0.002 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0 612*c42dbd0eSchristos 0.096 0.096 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0 613*c42dbd0eSchristos 0.375 0.375 [37] 4021dd: addsd %xmm0,%xmm1 614*c42dbd0eSchristos ## 1.683 1.683 [36] 4021e1: add $0x1,%rax 615*c42dbd0eSchristos 0.004 0.004 [36] 4021e5: cmp %rax,%rcx 616*c42dbd0eSchristos 0. 0. [36] 4021e8: jne 0xffffffffffffffea 617*c42dbd0eSchristos 38. c[i] = row_sum; 618*c42dbd0eSchristos 0. 0. [38] 4021ea: movsd %xmm1,(%r10,%rdi,8) 619*c42dbd0eSchristos 0. 0. [34] 4021f0: add $0x1,%rdi 620*c42dbd0eSchristos 0. 0. [34] 4021f4: cmp %rdi,%rsi 621*c42dbd0eSchristos 0. 0. [34] 4021f7: jb 0xd 622*c42dbd0eSchristos 0. 0. [35] 4021f9: pxor %xmm1,%xmm1 623*c42dbd0eSchristos 0. 0. [36] 4021fd: test %rcx,%rcx 624*c42dbd0eSchristos 0. 0. [36] 402200: jne 0xffffffffffffffc5 625*c42dbd0eSchristos 0. 0. [36] 402202: jmp 0xffffffffffffffe8 626*c42dbd0eSchristos 39. } 627*c42dbd0eSchristos 40. } 628*c42dbd0eSchristos 0. 0. [40] 402204: ret 629*c42dbd0eSchristos 630*c42dbd0eSchristos For each instruction, the timing values are given and we can exactly 631*c42dbd0eSchristoswhich ones are the most expensive. As with the source level view, the 632*c42dbd0eSchristosmost expensive instructions are market with the '##' symbol. 633*c42dbd0eSchristos 634*c42dbd0eSchristos As illustrated below and similar to the 'lines' command, we can get 635*c42dbd0eSchristosan overview of the instructions executed by using the 'pcs' command. 636*c42dbd0eSchristos 637*c42dbd0eSchristosBelow the command and the output, which again has been restricted to 10 638*c42dbd0eSchristoslines: 639*c42dbd0eSchristos 640*c42dbd0eSchristos $ gprofng display text -pcs test.1.er 641*c42dbd0eSchristos 642*c42dbd0eSchristos PCs sorted by metric: Exclusive Total CPU Time 643*c42dbd0eSchristos 644*c42dbd0eSchristos Excl. Incl. Name 645*c42dbd0eSchristos Total Total 646*c42dbd0eSchristos CPU sec. CPU sec. 647*c42dbd0eSchristos 2.272 2.272 <Total> 648*c42dbd0eSchristos 1.683 1.683 mxv_core + 0x00000027, line 36 in "mxv.c" 649*c42dbd0eSchristos 0.375 0.375 mxv_core + 0x00000023, line 37 in "mxv.c" 650*c42dbd0eSchristos 0.096 0.096 mxv_core + 0x0000001D, line 37 in "mxv.c" 651*c42dbd0eSchristos 0.027 0.027 init_data + 0x000000BD, line 72 in "manage_data.c" 652*c42dbd0eSchristos 0.012 0.012 init_data + 0x00000117, line 77 in "manage_data.c" 653*c42dbd0eSchristos 0.008 0.008 _int_malloc + 0x00000A45 654*c42dbd0eSchristos 0.007 0.007 erand48_r + 0x00000062 655*c42dbd0eSchristos 0.006 0.006 drand48 + 0x00000000 656*c42dbd0eSchristos 0.005 0.005 __drand48_iterate + 0x00000005 657*c42dbd0eSchristos 658*c42dbd0eSchristos 659*c42dbd0eSchristosFile: gprofng.info, Node: Display and Define the Metrics, Next: A First Customization of the Output, Prev: The Disassembly View, Up: Getting Started 660*c42dbd0eSchristos 661*c42dbd0eSchristos3.1.5 Display and Define the Metrics 662*c42dbd0eSchristos------------------------------------ 663*c42dbd0eSchristos 664*c42dbd0eSchristosThe default metrics shown by 'gprofng display text' are useful, but 665*c42dbd0eSchristosthere is more recorded than displayed. We can customize the values 666*c42dbd0eSchristosshown by defining the metrics ourselves. 667*c42dbd0eSchristos 668*c42dbd0eSchristos There are two commands related to changing the metrics shown: 669*c42dbd0eSchristos'metric_list' and 'metrics'. 670*c42dbd0eSchristos 671*c42dbd0eSchristos The first command shows the metrics in use, plus all the metrics that 672*c42dbd0eSchristoshave been stored as part of the experiment. The second command may be 673*c42dbd0eSchristosused to define the metric list. 674*c42dbd0eSchristos 675*c42dbd0eSchristos In our example we get the following values for the metrics: 676*c42dbd0eSchristos 677*c42dbd0eSchristos $ gprofng display text -metric_list test.1.er 678*c42dbd0eSchristos 679*c42dbd0eSchristos Current metrics: e.totalcpu:i.totalcpu:name 680*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) 681*c42dbd0eSchristos Available metrics: 682*c42dbd0eSchristos Exclusive Total CPU Time: e.%totalcpu 683*c42dbd0eSchristos Inclusive Total CPU Time: i.%totalcpu 684*c42dbd0eSchristos Size: size 685*c42dbd0eSchristos PC Address: address 686*c42dbd0eSchristos Name: name 687*c42dbd0eSchristos 688*c42dbd0eSchristos This shows the metrics currently in use, the metric that is used to 689*c42dbd0eSchristossort the data and all the metrics that have been recorded, but are not 690*c42dbd0eSchristosnecessarily shown. 691*c42dbd0eSchristos 692*c42dbd0eSchristos In this case, the default metrics are set to the exclusive and 693*c42dbd0eSchristosinclusive total CPU times, plus the name of the function, or load 694*c42dbd0eSchristosobject. 695*c42dbd0eSchristos 696*c42dbd0eSchristos The 'metrics' command is used to define the metrics that need to be 697*c42dbd0eSchristosdisplayed. 698*c42dbd0eSchristos 699*c42dbd0eSchristos For example, to display the exclusive total CPU time, both as a 700*c42dbd0eSchristosnumber and a percentage, use the following metric definition: 701*c42dbd0eSchristos'e.%totalcpu' 702*c42dbd0eSchristos 703*c42dbd0eSchristos Since the metrics can be tailored for different views, there is a way 704*c42dbd0eSchristosto reset them to the default. This is done through the special keyword 705*c42dbd0eSchristos'default'. 706*c42dbd0eSchristos 707*c42dbd0eSchristos 708*c42dbd0eSchristosFile: gprofng.info, Node: A First Customization of the Output, Next: Name the Experiment Directory, Prev: Display and Define the Metrics, Up: Getting Started 709*c42dbd0eSchristos 710*c42dbd0eSchristos3.1.6 A First Customization of the Output 711*c42dbd0eSchristos----------------------------------------- 712*c42dbd0eSchristos 713*c42dbd0eSchristosWith the information just given, we can customize the function overview. 714*c42dbd0eSchristosFor sake of the example, we would like to display the name of the 715*c42dbd0eSchristosfunction first, followed by the exclusive CPU time, given as an absolute 716*c42dbd0eSchristosnumber and a percentage. 717*c42dbd0eSchristos 718*c42dbd0eSchristos Note that the commands are parsed in order of appearance. This is 719*c42dbd0eSchristoswhy we need to define the metrics _before_ requesting the function 720*c42dbd0eSchristosoverview: 721*c42dbd0eSchristos 722*c42dbd0eSchristos $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er 723*c42dbd0eSchristos 724*c42dbd0eSchristos Current metrics: name:e.%totalcpu 725*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 726*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 727*c42dbd0eSchristos 728*c42dbd0eSchristos Name Excl. Total 729*c42dbd0eSchristos CPU 730*c42dbd0eSchristos sec. % 731*c42dbd0eSchristos <Total> 2.272 100.00 732*c42dbd0eSchristos mxv_core 2.160 95.04 733*c42dbd0eSchristos init_data 0.047 2.06 734*c42dbd0eSchristos erand48_r 0.030 1.32 735*c42dbd0eSchristos __drand48_iterate 0.013 0.57 736*c42dbd0eSchristos drand48 0.013 0.57 737*c42dbd0eSchristos _int_malloc 0.008 0.35 738*c42dbd0eSchristos brk 0.001 0.04 739*c42dbd0eSchristos sysmalloc 0.001 0.04 740*c42dbd0eSchristos __default_morecore 0. 0. 741*c42dbd0eSchristos __libc_start_main 0. 0. 742*c42dbd0eSchristos allocate_data 0. 0. 743*c42dbd0eSchristos collector_root 0. 0. 744*c42dbd0eSchristos driver_mxv 0. 0. 745*c42dbd0eSchristos main 0. 0. 746*c42dbd0eSchristos malloc 0. 0. 747*c42dbd0eSchristos sbrk 0. 0. 748*c42dbd0eSchristos 749*c42dbd0eSchristos This was a first and simple example how to customize the output. 750*c42dbd0eSchristosNote that we did not rerun our profiling job and merely modified the 751*c42dbd0eSchristosdisplay settings. Below we will show other and also more advanced 752*c42dbd0eSchristosexamples of customization. 753*c42dbd0eSchristos 754*c42dbd0eSchristos 755*c42dbd0eSchristosFile: gprofng.info, Node: Name the Experiment Directory, Next: Control the Number of Lines in the Output, Prev: A First Customization of the Output, Up: Getting Started 756*c42dbd0eSchristos 757*c42dbd0eSchristos3.1.7 Name the Experiment Directory 758*c42dbd0eSchristos----------------------------------- 759*c42dbd0eSchristos 760*c42dbd0eSchristosWhen using 'gprofng collect app', the default names for experiments work 761*c42dbd0eSchristosfine, but they are quite generic. It is often more convenient to select 762*c42dbd0eSchristosa more descriptive name. For example, one that reflects conditions for 763*c42dbd0eSchristosthe experiment conducted. 764*c42dbd0eSchristos 765*c42dbd0eSchristos For this, the mutually exclusive '-o' and '-O' options come in handy. 766*c42dbd0eSchristosBoth may be used to provide a name for the experiment directory, but the 767*c42dbd0eSchristosbehaviour of 'gprofng collect app' is different. 768*c42dbd0eSchristos 769*c42dbd0eSchristos With the '-o' option, an existing experiment directory is not 770*c42dbd0eSchristosoverwritten. You either need to explicitly remove an existing directory 771*c42dbd0eSchristosfirst, or use a name that is not in use yet. 772*c42dbd0eSchristos 773*c42dbd0eSchristos This is in contrast with the behaviour for the '-O' option. Any 774*c42dbd0eSchristosexisting (experiment) directory with the same name is silently 775*c42dbd0eSchristosoverwritten. 776*c42dbd0eSchristos 777*c42dbd0eSchristos Be aware that the name of the experiment directory has to end with 778*c42dbd0eSchristos'.er'. 779*c42dbd0eSchristos 780*c42dbd0eSchristos 781*c42dbd0eSchristosFile: gprofng.info, Node: Control the Number of Lines in the Output, Next: Sorting the Performance Data, Prev: Name the Experiment Directory, Up: Getting Started 782*c42dbd0eSchristos 783*c42dbd0eSchristos3.1.8 Control the Number of Lines in the Output 784*c42dbd0eSchristos----------------------------------------------- 785*c42dbd0eSchristos 786*c42dbd0eSchristosThe 'limit <n>' command can be used to control the number of lines 787*c42dbd0eSchristosprinted in various overviews, including the function view, but it also 788*c42dbd0eSchristostakes effect for other display commands, like 'lines'. 789*c42dbd0eSchristos 790*c42dbd0eSchristos The argument '<n>' should be a positive integer number. It sets the 791*c42dbd0eSchristosnumber of lines in the function view. A value of zero resets the limit 792*c42dbd0eSchristosto the default. 793*c42dbd0eSchristos 794*c42dbd0eSchristos Be aware that the pseudo-function '<Total>' counts as a regular 795*c42dbd0eSchristosfunction. For example 'limit 10' displays nine user level functions. 796*c42dbd0eSchristos 797*c42dbd0eSchristos 798*c42dbd0eSchristosFile: gprofng.info, Node: Sorting the Performance Data, Next: Scripting, Prev: Control the Number of Lines in the Output, Up: Getting Started 799*c42dbd0eSchristos 800*c42dbd0eSchristos3.1.9 Sorting the Performance Data 801*c42dbd0eSchristos---------------------------------- 802*c42dbd0eSchristos 803*c42dbd0eSchristosThe 'sort <key>' command sets the key to be used when sorting the 804*c42dbd0eSchristosperformance data. 805*c42dbd0eSchristos 806*c42dbd0eSchristos The key is a valid metric definition, but the visibility field (*Note 807*c42dbd0eSchristosMetric Definitions::) in the metric definition is ignored since this 808*c42dbd0eSchristosdoes not affect the outcome of the sorting operation. For example if we 809*c42dbd0eSchristosset the sort key to 'e.totalcpu', the values will be sorted in 810*c42dbd0eSchristosdescending order with respect to the exclusive total CPU time. 811*c42dbd0eSchristos 812*c42dbd0eSchristos The data can be sorted in reverse order by prepending the metric 813*c42dbd0eSchristosdefinition with a minus ('-') sign. For example 'sort -e.totalcpu'. 814*c42dbd0eSchristos 815*c42dbd0eSchristos A default metric for the sort operation has been defined and since 816*c42dbd0eSchristosthis is a persistent command, this default can be restored with 817*c42dbd0eSchristos'default' as the key. 818*c42dbd0eSchristos 819*c42dbd0eSchristos 820*c42dbd0eSchristosFile: gprofng.info, Node: Scripting, Next: A More Elaborate Example, Prev: Sorting the Performance Data, Up: Getting Started 821*c42dbd0eSchristos 822*c42dbd0eSchristos3.1.10 Scripting 823*c42dbd0eSchristos---------------- 824*c42dbd0eSchristos 825*c42dbd0eSchristosAs is probably clear by now, the list with commands for 'gprofng display 826*c42dbd0eSchristostext' can be very long. This is tedious and also error prone. Luckily, 827*c42dbd0eSchristosthere is an easier and more elegant way to control the behaviour of this 828*c42dbd0eSchristostool. 829*c42dbd0eSchristos 830*c42dbd0eSchristos Through the 'script' command, the name of a file with commands can be 831*c42dbd0eSchristospassed in. These commands are parsed and executed as if they appeared 832*c42dbd0eSchristoson the command line in the same order as encountered in the file. The 833*c42dbd0eSchristoscommands in this script file can actually be mixed with commands on the 834*c42dbd0eSchristoscommand line. 835*c42dbd0eSchristos 836*c42dbd0eSchristos The difference between the commands in the script file and those used 837*c42dbd0eSchristoson the command line is that the latter require a leading dash ('-') 838*c42dbd0eSchristossymbol. 839*c42dbd0eSchristos 840*c42dbd0eSchristos Comment lines are supported. They need to start with the '#' symbol. 841*c42dbd0eSchristos 842*c42dbd0eSchristos 843*c42dbd0eSchristosFile: gprofng.info, Node: A More Elaborate Example, Next: The Call Tree, Prev: Scripting, Up: Getting Started 844*c42dbd0eSchristos 845*c42dbd0eSchristos3.1.11 A More Elaborate Example 846*c42dbd0eSchristos------------------------------- 847*c42dbd0eSchristos 848*c42dbd0eSchristosWith the information presented so far, we can customize our data 849*c42dbd0eSchristosgathering and display commands. 850*c42dbd0eSchristos 851*c42dbd0eSchristos As an example, to reflect the name of the algorithm and the number of 852*c42dbd0eSchristosthreads that were used in the experiment, we select 'mxv.1.thr.er' as 853*c42dbd0eSchristosthe name of the experiment directory. All we then need to do is to add 854*c42dbd0eSchristosthe '-O' option followed by this name on the command line when running 855*c42dbd0eSchristos'gprofng collect app': 856*c42dbd0eSchristos 857*c42dbd0eSchristos $ exe=mxv-pthreads.exe 858*c42dbd0eSchristos $ m=3000 859*c42dbd0eSchristos $ n=2000 860*c42dbd0eSchristos $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1 861*c42dbd0eSchristos 862*c42dbd0eSchristos The commands to generate the profile are put into a file that we 863*c42dbd0eSchristossimply call 'my-script': 864*c42dbd0eSchristos 865*c42dbd0eSchristos $ cat my-script 866*c42dbd0eSchristos # This is my first gprofng script 867*c42dbd0eSchristos # Set the metrics 868*c42dbd0eSchristos metrics i.%totalcpu:e.%totalcpu:name 869*c42dbd0eSchristos # Use the exclusive time to sort 870*c42dbd0eSchristos sort e.totalcpu 871*c42dbd0eSchristos # Limit the function list to 5 lines 872*c42dbd0eSchristos limit 5 873*c42dbd0eSchristos # Show the function list 874*c42dbd0eSchristos functions 875*c42dbd0eSchristos 876*c42dbd0eSchristos This script file is then specified as input to the 'gprofng display 877*c42dbd0eSchristostext' command that is used to display the performance information stored 878*c42dbd0eSchristosin 'mxv.1.thr.er': 879*c42dbd0eSchristos 880*c42dbd0eSchristos $ gprofng display text -script my-script mxv.1.thr.er 881*c42dbd0eSchristos 882*c42dbd0eSchristos The command above produces the following output: 883*c42dbd0eSchristos 884*c42dbd0eSchristos # This is my first gprofng script 885*c42dbd0eSchristos # Set the metrics 886*c42dbd0eSchristos Current metrics: i.%totalcpu:e.%totalcpu:name 887*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 888*c42dbd0eSchristos # Use the exclusive time to sort 889*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 890*c42dbd0eSchristos # Limit the function list to 5 lines 891*c42dbd0eSchristos Print limit set to 5 892*c42dbd0eSchristos # Show the function list 893*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 894*c42dbd0eSchristos 895*c42dbd0eSchristos Incl. Total Excl. Total Name 896*c42dbd0eSchristos CPU CPU 897*c42dbd0eSchristos sec. % sec. % 898*c42dbd0eSchristos 2.272 100.00 2.272 100.00 <Total> 899*c42dbd0eSchristos 2.159 95.00 2.159 95.00 mxv_core 900*c42dbd0eSchristos 0.102 4.48 0.054 2.37 init_data 901*c42dbd0eSchristos 0.035 1.54 0.025 1.10 erand48_r 902*c42dbd0eSchristos 0.048 2.11 0.013 0.57 drand48 903*c42dbd0eSchristos 904*c42dbd0eSchristos In the first part of the output, our comment lines in the script file 905*c42dbd0eSchristosare shown. These are interleaved with an acknowledgement message for 906*c42dbd0eSchristosthe commands. 907*c42dbd0eSchristos 908*c42dbd0eSchristos This is followed by a profile consisting of 5 lines only. For both 909*c42dbd0eSchristosmetrics, the percentages plus the timings are given. The numbers are 910*c42dbd0eSchristossorted with respect to the exclusive total CPU time. 911*c42dbd0eSchristos 912*c42dbd0eSchristos It is now immediately clear that function 'mxv_core' is responsbile 913*c42dbd0eSchristosfor 95% of the CPU time and 'init_data' takes 4.5% only. 914*c42dbd0eSchristos 915*c42dbd0eSchristos This is also where we see sampling in action. Although this is 916*c42dbd0eSchristosexactly the same job we profiled before, the timings are somewhat 917*c42dbd0eSchristosdifferent, but the differences are very small. 918*c42dbd0eSchristos 919*c42dbd0eSchristos 920*c42dbd0eSchristosFile: gprofng.info, Node: The Call Tree, Next: More Information on the Experiment, Prev: A More Elaborate Example, Up: Getting Started 921*c42dbd0eSchristos 922*c42dbd0eSchristos3.1.12 The Call Tree 923*c42dbd0eSchristos-------------------- 924*c42dbd0eSchristos 925*c42dbd0eSchristosThe call tree shows the dynamic hierarchy of the application by 926*c42dbd0eSchristosdisplaying the functions executed and their parent. It helps to find 927*c42dbd0eSchristosthe most expensive path in the program. 928*c42dbd0eSchristos 929*c42dbd0eSchristos This feature is enabled through the 'calltree' command. This is how 930*c42dbd0eSchristosto get this tree for our current experiment: 931*c42dbd0eSchristos 932*c42dbd0eSchristos $ gprofng display text -calltree mxv.1.thr.er 933*c42dbd0eSchristos 934*c42dbd0eSchristos This displays the following structure: 935*c42dbd0eSchristos 936*c42dbd0eSchristos Functions Call Tree. Metric: Attributed Total CPU Time 937*c42dbd0eSchristos 938*c42dbd0eSchristos Attr. Name 939*c42dbd0eSchristos Total 940*c42dbd0eSchristos CPU sec. 941*c42dbd0eSchristos 2.272 +-<Total> 942*c42dbd0eSchristos 2.159 +-collector_root 943*c42dbd0eSchristos 2.159 | +-driver_mxv 944*c42dbd0eSchristos 2.159 | +-mxv_core 945*c42dbd0eSchristos 0.114 +-__libc_start_main 946*c42dbd0eSchristos 0.114 +-main 947*c42dbd0eSchristos 0.102 +-init_data 948*c42dbd0eSchristos 0.048 | +-drand48 949*c42dbd0eSchristos 0.035 | +-erand48_r 950*c42dbd0eSchristos 0.010 | +-__drand48_iterate 951*c42dbd0eSchristos 0.011 +-allocate_data 952*c42dbd0eSchristos 0.011 | +-malloc 953*c42dbd0eSchristos 0.011 | +-_int_malloc 954*c42dbd0eSchristos 0.001 | +-sysmalloc 955*c42dbd0eSchristos 0.001 +-check_results 956*c42dbd0eSchristos 0.001 +-malloc 957*c42dbd0eSchristos 0.001 +-_int_malloc 958*c42dbd0eSchristos 959*c42dbd0eSchristos At first sight this may not be what you expected and some explanation 960*c42dbd0eSchristosis in place. 961*c42dbd0eSchristos 962*c42dbd0eSchristos First of all, function 'collector_root' is internal to 'gprofng' and 963*c42dbd0eSchristosshould be hidden to the user. This is part of a planned future 964*c42dbd0eSchristosenhancement. 965*c42dbd0eSchristos 966*c42dbd0eSchristos Recall that the 'objects' and 'fsingle' commands are very useful to 967*c42dbd0eSchristosfind out more about load objects in general, but also to help identify 968*c42dbd0eSchristosan unknown entry in the function overview. *Note Load Objects and 969*c42dbd0eSchristosFunctions::. 970*c42dbd0eSchristos 971*c42dbd0eSchristos Another thing to note is that there are two main branches. The one 972*c42dbd0eSchristosunder 'collector_root' and the second one under '__libc_start_main'. 973*c42dbd0eSchristosThis reflects the fact that we are executing a parallel program. Even 974*c42dbd0eSchristosthough we only used one thread for this run, this is still executed in a 975*c42dbd0eSchristosseparate path. 976*c42dbd0eSchristos 977*c42dbd0eSchristos The main, sequential part of the program is displayed under 'main' 978*c42dbd0eSchristosand shows the functions called and the time they took. 979*c42dbd0eSchristos 980*c42dbd0eSchristos There are two things worth noting for the call tree feature: 981*c42dbd0eSchristos 982*c42dbd0eSchristos * This is a dynamic tree and since sampling is used, it most likely 983*c42dbd0eSchristos looks slighlty different across seemingly identical profile runs. 984*c42dbd0eSchristos In case the run times are short, it is worth considering to use a 985*c42dbd0eSchristos high resolution through the '-p' option. For example to use '-p 986*c42dbd0eSchristos hi' to increase the sampling rate. 987*c42dbd0eSchristos 988*c42dbd0eSchristos * In case hardware event counters have been enabled (*Note Profile 989*c42dbd0eSchristos Hardware Event Counters::), these values are also displayed in the 990*c42dbd0eSchristos call tree view. 991*c42dbd0eSchristos 992*c42dbd0eSchristos 993*c42dbd0eSchristosFile: gprofng.info, Node: More Information on the Experiment, Next: Control the Sampling Frequency, Prev: The Call Tree, Up: Getting Started 994*c42dbd0eSchristos 995*c42dbd0eSchristos3.1.13 More Information on the Experiment 996*c42dbd0eSchristos----------------------------------------- 997*c42dbd0eSchristos 998*c42dbd0eSchristosThe experiment directory not only contains performance related data. 999*c42dbd0eSchristosSeveral system characteristics, the actually command executed, and some 1000*c42dbd0eSchristosglobal performance statistics can be displayed. 1001*c42dbd0eSchristos 1002*c42dbd0eSchristos The 'header' command displays information about the experiment(s). 1003*c42dbd0eSchristosFor example, this is the command to extract this data from for our 1004*c42dbd0eSchristosexperiment directory: 1005*c42dbd0eSchristos 1006*c42dbd0eSchristos $ gprofng display text -header mxv.1.thr.er 1007*c42dbd0eSchristos 1008*c42dbd0eSchristos The above command prints the following information. Note that some 1009*c42dbd0eSchristosof the lay-out and the information has been modified. The textual 1010*c42dbd0eSchristoschanges are marked with the '<' and '>' symbols. 1011*c42dbd0eSchristos 1012*c42dbd0eSchristos Experiment: mxv.1.thr.er 1013*c42dbd0eSchristos No errors 1014*c42dbd0eSchristos No warnings 1015*c42dbd0eSchristos Archive command `gp-archive -n -a on 1016*c42dbd0eSchristos --outfile <exp_dir>/archive.log <exp_dir>' 1017*c42dbd0eSchristos 1018*c42dbd0eSchristos Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1' 1019*c42dbd0eSchristos Process pid 30591, ppid 30589, pgrp 30551, sid 30468 1020*c42dbd0eSchristos Current working directory: <cwd> 1021*c42dbd0eSchristos Collector version: `2.36.50'; experiment version 12.4 (64-bit) 1022*c42dbd0eSchristos Host `<hostname>', OS `Linux <version>', page size 4096, 1023*c42dbd0eSchristos architecture `x86_64' 1024*c42dbd0eSchristos 16 CPUs, clock speed 1995 MHz. 1025*c42dbd0eSchristos Memory: 30871514 pages @ 4096 = 120591 MB. 1026*c42dbd0eSchristos Data collection parameters: 1027*c42dbd0eSchristos Clock-profiling, interval = 997 microsecs. 1028*c42dbd0eSchristos Periodic sampling, 1 secs. 1029*c42dbd0eSchristos Follow descendant processes from: fork|exec|combo 1030*c42dbd0eSchristos 1031*c42dbd0eSchristos Experiment started <date and time> 1032*c42dbd0eSchristos 1033*c42dbd0eSchristos Experiment Ended: 2.293162658 1034*c42dbd0eSchristos Data Collection Duration: 2.293162658 1035*c42dbd0eSchristos 1036*c42dbd0eSchristos The output above may assist in troubleshooting, or to verify some of 1037*c42dbd0eSchristosthe operational conditions and we recommand to include this command when 1038*c42dbd0eSchristosgenerating a profile. 1039*c42dbd0eSchristos 1040*c42dbd0eSchristos Related to this command there is a useful option to record your own 1041*c42dbd0eSchristoscomment(s) in an experiment. To this end, use the '-C' option on the 1042*c42dbd0eSchristos'gprofng collect app' tool to specify a comment string. Up to ten 1043*c42dbd0eSchristoscomment lines can be included. These comments are displayed with the 1044*c42dbd0eSchristos'header' command on the 'gprofng display text' tool. 1045*c42dbd0eSchristos 1046*c42dbd0eSchristos The 'overview' command displays information on the experiment(s) and 1047*c42dbd0eSchristosalso shows a summary of the values for the metric(s) used. This is an 1048*c42dbd0eSchristosexample how to use it on our newly created experiment directory: 1049*c42dbd0eSchristos 1050*c42dbd0eSchristos $ gprofng display text -overview mxv.1.thr.er 1051*c42dbd0eSchristos 1052*c42dbd0eSchristos Experiment(s): 1053*c42dbd0eSchristos 1054*c42dbd0eSchristos Experiment :mxv.1.thr.er 1055*c42dbd0eSchristos Target : './mxv-pthreads.exe -m 3000 -n 2000 -t 1' 1056*c42dbd0eSchristos Host : <hostname> (<ISA>, Linux <version>) 1057*c42dbd0eSchristos Start Time : <date and time> 1058*c42dbd0eSchristos Duration : 2.293 Seconds 1059*c42dbd0eSchristos 1060*c42dbd0eSchristos Metrics: 1061*c42dbd0eSchristos 1062*c42dbd0eSchristos Experiment Duration (Seconds): [2.293] 1063*c42dbd0eSchristos Clock Profiling 1064*c42dbd0eSchristos [X]Total CPU Time - totalcpu (Seconds): [*2.272] 1065*c42dbd0eSchristos 1066*c42dbd0eSchristos Notes: '*' indicates hot metrics, '[X]' indicates currently enabled 1067*c42dbd0eSchristos metrics. 1068*c42dbd0eSchristos The metrics command can be used to change selections. The 1069*c42dbd0eSchristos metric_list command lists all available metrics. 1070*c42dbd0eSchristos 1071*c42dbd0eSchristos This command provides a dashboard overview that helps to easily 1072*c42dbd0eSchristosidentify where the time is spent and in case hardware event counters are 1073*c42dbd0eSchristosused, it shows their total values. 1074*c42dbd0eSchristos 1075*c42dbd0eSchristos 1076*c42dbd0eSchristosFile: gprofng.info, Node: Control the Sampling Frequency, Next: Information on Load Objects, Prev: More Information on the Experiment, Up: Getting Started 1077*c42dbd0eSchristos 1078*c42dbd0eSchristos3.1.14 Control the Sampling Frequency 1079*c42dbd0eSchristos------------------------------------- 1080*c42dbd0eSchristos 1081*c42dbd0eSchristosSo far we did not talk about the frequency of the sampling process, but 1082*c42dbd0eSchristosin some cases it is useful to change the default of 10 milliseconds. 1083*c42dbd0eSchristos 1084*c42dbd0eSchristos The advantage of increasing the sampling frequency is that functions 1085*c42dbd0eSchristosthat do not take much time per invocation are more accurately captured. 1086*c42dbd0eSchristosThe downside is that more data is gathered. This has an impact on the 1087*c42dbd0eSchristosoverhead of the collection process and more disk space is required. 1088*c42dbd0eSchristos 1089*c42dbd0eSchristos In general this is not an immediate concern, but with heavily 1090*c42dbd0eSchristosthreaded applications that run for an extended period of time, 1091*c42dbd0eSchristosincreasing the frequency may have a more noticeable impact. 1092*c42dbd0eSchristos 1093*c42dbd0eSchristos The '-p' option on the 'gprofng collect app' tool is used to enable 1094*c42dbd0eSchristosor disable clock based profiling, or to explicitly set the sampling 1095*c42dbd0eSchristosrate. This option takes one of the following keywords: 1096*c42dbd0eSchristos 1097*c42dbd0eSchristos'off' 1098*c42dbd0eSchristos Disable clock based profiling. 1099*c42dbd0eSchristos 1100*c42dbd0eSchristos'on' 1101*c42dbd0eSchristos Enable clock based profiling with a per thread sampling interval of 1102*c42dbd0eSchristos 10 ms. This is the default. 1103*c42dbd0eSchristos 1104*c42dbd0eSchristos'lo' 1105*c42dbd0eSchristos Enable clock based profiling with a per thread sampling interval of 1106*c42dbd0eSchristos 100 ms. 1107*c42dbd0eSchristos 1108*c42dbd0eSchristos'hi' 1109*c42dbd0eSchristos Enable clock based profiling with a per thread sampling interval of 1110*c42dbd0eSchristos 1 ms. 1111*c42dbd0eSchristos 1112*c42dbd0eSchristos'<value>' 1113*c42dbd0eSchristos Enable clock based profiling with a per thread sampling interval of 1114*c42dbd0eSchristos <value>. 1115*c42dbd0eSchristos 1116*c42dbd0eSchristos One may wonder why there is an option to disable clock based 1117*c42dbd0eSchristosprofiling. This is because by default, it is enabled when conducting 1118*c42dbd0eSchristoshardware event counter experiments (*Note Profile Hardware Event 1119*c42dbd0eSchristosCounters::). With the '-p off' option, this can be disabled. 1120*c42dbd0eSchristos 1121*c42dbd0eSchristos If an explicit value is set for the sampling, the number can be an 1122*c42dbd0eSchristosinteger or a floating-point number. A suffix of 'u' for microseconds, 1123*c42dbd0eSchristosor 'm' for milliseconds is supported. If no suffix is used, the value 1124*c42dbd0eSchristosis assumed to be in milliseconds. 1125*c42dbd0eSchristos 1126*c42dbd0eSchristos If the value is smaller than the clock profiling minimum, a warning 1127*c42dbd0eSchristosmessage is issued and it is set to the minimum. In case it is not a 1128*c42dbd0eSchristosmultiple of the clock profiling resolution, it is silently rounded down 1129*c42dbd0eSchristosto the nearest multiple of the clock resolution. 1130*c42dbd0eSchristos 1131*c42dbd0eSchristos If the value exceeds the clock profiling maximum, is negative, or 1132*c42dbd0eSchristoszero, an error is reported. 1133*c42dbd0eSchristos 1134*c42dbd0eSchristos Note that the 'header' command echoes the sampling rate used. 1135*c42dbd0eSchristos 1136*c42dbd0eSchristos 1137*c42dbd0eSchristosFile: gprofng.info, Node: Information on Load Objects, Prev: Control the Sampling Frequency, Up: Getting Started 1138*c42dbd0eSchristos 1139*c42dbd0eSchristos3.1.15 Information on Load Objects 1140*c42dbd0eSchristos---------------------------------- 1141*c42dbd0eSchristos 1142*c42dbd0eSchristosIt may happen that the function list contains a function that is not 1143*c42dbd0eSchristosknown to the user. This can easily happen with library functions for 1144*c42dbd0eSchristosexample. Luckily there are three commands that come in handy then. 1145*c42dbd0eSchristos 1146*c42dbd0eSchristos These commands are 'objects', 'fsingle', and 'fsummary'. They 1147*c42dbd0eSchristosprovide details on load objects (*Note Load Objects and Functions::). 1148*c42dbd0eSchristos 1149*c42dbd0eSchristos The 'objects' command lists all load objects that have been 1150*c42dbd0eSchristosreferenced during the performance experiment. Below we show the command 1151*c42dbd0eSchristosand the result for our profile job. Like before, the (long) path names 1152*c42dbd0eSchristosin the output have been shortened and replaced by the '<apath>' symbol 1153*c42dbd0eSchristosthat represents an absolute directory path. 1154*c42dbd0eSchristos 1155*c42dbd0eSchristos $ gprofng display text -objects mxv.1.thr.er 1156*c42dbd0eSchristos 1157*c42dbd0eSchristos The output includes the name and path of the target executable: 1158*c42dbd0eSchristos 1159*c42dbd0eSchristos <Unknown> (<Unknown>) 1160*c42dbd0eSchristos <mxv-pthreads.exe> (<apath>/mxv-pthreads.exe) 1161*c42dbd0eSchristos <librt-2.17.so> (/usr/lib64/librt-2.17.so) 1162*c42dbd0eSchristos <libdl-2.17.so> (/usr/lib64/libdl-2.17.so) 1163*c42dbd0eSchristos <libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>) 1164*c42dbd0eSchristos <libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>) 1165*c42dbd0eSchristos <libc-2.17.so> (/usr/lib64/libc-2.17.so) 1166*c42dbd0eSchristos <libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so) 1167*c42dbd0eSchristos <libm-2.17.so> (/usr/lib64/libm-2.17.so) 1168*c42dbd0eSchristos <libgp-collector.so> (<apath>/libgp-collector.so) 1169*c42dbd0eSchristos <ld-2.17.so> (/usr/lib64/ld-2.17.so) 1170*c42dbd0eSchristos <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS) 1171*c42dbd0eSchristos 1172*c42dbd0eSchristos The 'fsingle' command may be used to get more details on a specific 1173*c42dbd0eSchristosentry in the function view, say. For example, the command below 1174*c42dbd0eSchristosprovides additional information on the 'collector_root' function shown 1175*c42dbd0eSchristosin the function overview. 1176*c42dbd0eSchristos 1177*c42dbd0eSchristos $ gprofng display text -fsingle collector_root mxv.1.thr.er 1178*c42dbd0eSchristos 1179*c42dbd0eSchristos Below the output from this command. It has been somewhat modified to 1180*c42dbd0eSchristosmatch the display requirements. 1181*c42dbd0eSchristos 1182*c42dbd0eSchristos collector_root 1183*c42dbd0eSchristos Exclusive Total CPU Time: 0. ( 0. %) 1184*c42dbd0eSchristos Inclusive Total CPU Time: 2.159 ( 95.0%) 1185*c42dbd0eSchristos Size: 401 1186*c42dbd0eSchristos PC Address: 10:0x0001db60 1187*c42dbd0eSchristos Source File: <apath>/dispatcher.c 1188*c42dbd0eSchristos Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b 1189*c42dbd0eSchristos Load Object: <apath>/libgp-collector.so 1190*c42dbd0eSchristos Mangled Name: 1191*c42dbd0eSchristos Aliases: 1192*c42dbd0eSchristos 1193*c42dbd0eSchristos In this table we not only see how much time was spent in this 1194*c42dbd0eSchristosfunction, we also see where it originates from. In addition to this, 1195*c42dbd0eSchristosthe size and start address are given as well. If the source code 1196*c42dbd0eSchristoslocation is known it is also shown here. 1197*c42dbd0eSchristos 1198*c42dbd0eSchristos The related 'fsummary' command displays the same information as 1199*c42dbd0eSchristos'fsingle', but for all functions in the function overview, including 1200*c42dbd0eSchristos'<Total>': 1201*c42dbd0eSchristos 1202*c42dbd0eSchristos $ gprofng display text -fsummary mxv.1.thr.er 1203*c42dbd0eSchristos 1204*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1205*c42dbd0eSchristos 1206*c42dbd0eSchristos <Total> 1207*c42dbd0eSchristos Exclusive Total CPU Time: 2.272 (100.0%) 1208*c42dbd0eSchristos Inclusive Total CPU Time: 2.272 (100.0%) 1209*c42dbd0eSchristos Size: 0 1210*c42dbd0eSchristos PC Address: 1:0x00000000 1211*c42dbd0eSchristos Source File: (unknown) 1212*c42dbd0eSchristos Object File: (unknown) 1213*c42dbd0eSchristos Load Object: <Total> 1214*c42dbd0eSchristos Mangled Name: 1215*c42dbd0eSchristos Aliases: 1216*c42dbd0eSchristos 1217*c42dbd0eSchristos mxv_core 1218*c42dbd0eSchristos Exclusive Total CPU Time: 2.159 ( 95.0%) 1219*c42dbd0eSchristos Inclusive Total CPU Time: 2.159 ( 95.0%) 1220*c42dbd0eSchristos Size: 75 1221*c42dbd0eSchristos PC Address: 2:0x000021ba 1222*c42dbd0eSchristos Source File: <apath>/mxv.c 1223*c42dbd0eSchristos Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc 1224*c42dbd0eSchristos Load Object: <apath>/mxv-pthreads.exe 1225*c42dbd0eSchristos Mangled Name: 1226*c42dbd0eSchristos Aliases: 1227*c42dbd0eSchristos 1228*c42dbd0eSchristos ... etc ... 1229*c42dbd0eSchristos 1230*c42dbd0eSchristos 1231*c42dbd0eSchristosFile: gprofng.info, Node: Support for Multithreading, Next: Viewing Multiple Experiments, Prev: Getting Started, Up: A Mini Tutorial 1232*c42dbd0eSchristos 1233*c42dbd0eSchristos3.2 Support for Multithreading 1234*c42dbd0eSchristos============================== 1235*c42dbd0eSchristos 1236*c42dbd0eSchristosIn this chapter we introduce and discuss the support for multithreading. 1237*c42dbd0eSchristosAs is shown below, nothing needs to be changed when collecting the 1238*c42dbd0eSchristosperformance data. 1239*c42dbd0eSchristos 1240*c42dbd0eSchristos The difference is that additional commands are available to get more 1241*c42dbd0eSchristosinformation on the parallel environment, plus that several filters allow 1242*c42dbd0eSchristosthe user to zoom in on specific threads. 1243*c42dbd0eSchristos 1244*c42dbd0eSchristos* Menu: 1245*c42dbd0eSchristos 1246*c42dbd0eSchristos* Creating a Multithreading Experiment:: 1247*c42dbd0eSchristos* Commands Specific to Multithreading:: 1248*c42dbd0eSchristos 1249*c42dbd0eSchristos 1250*c42dbd0eSchristosFile: gprofng.info, Node: Creating a Multithreading Experiment, Next: Commands Specific to Multithreading, Up: Support for Multithreading 1251*c42dbd0eSchristos 1252*c42dbd0eSchristos3.2.1 Creating a Multithreading Experiment 1253*c42dbd0eSchristos------------------------------------------ 1254*c42dbd0eSchristos 1255*c42dbd0eSchristosWe demonstrate the support for multithreading using the same code and 1256*c42dbd0eSchristossettings as before, but this time we use 2 threads: 1257*c42dbd0eSchristos 1258*c42dbd0eSchristos $ exe=mxv-pthreads.exe 1259*c42dbd0eSchristos $ m=3000 1260*c42dbd0eSchristos $ n=2000 1261*c42dbd0eSchristos $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2 1262*c42dbd0eSchristos 1263*c42dbd0eSchristos First of all, note that we did not change anything, other than 1264*c42dbd0eSchristossetting the number of threads to 2. Nothing special is needed to 1265*c42dbd0eSchristosprofile a multithreaded job when using 'gprofng'. 1266*c42dbd0eSchristos 1267*c42dbd0eSchristos The same is true when displaying the performance results. The same 1268*c42dbd0eSchristoscommands that we used before work unmodified. For example, this is all 1269*c42dbd0eSchristosthat is needed to get a function overview: 1270*c42dbd0eSchristos 1271*c42dbd0eSchristos $ gpprofng display text -limit 10 -functions mxv.2.thr.er 1272*c42dbd0eSchristos 1273*c42dbd0eSchristos This produces the following familiar looking output: 1274*c42dbd0eSchristos 1275*c42dbd0eSchristos Print limit set to 10 1276*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1277*c42dbd0eSchristos 1278*c42dbd0eSchristos Excl. Incl. Name 1279*c42dbd0eSchristos Total Total 1280*c42dbd0eSchristos CPU sec. CPU sec. 1281*c42dbd0eSchristos 2.268 2.268 <Total> 1282*c42dbd0eSchristos 2.155 2.155 mxv_core 1283*c42dbd0eSchristos 0.044 0.103 init_data 1284*c42dbd0eSchristos 0.030 0.046 erand48_r 1285*c42dbd0eSchristos 0.016 0.016 __drand48_iterate 1286*c42dbd0eSchristos 0.013 0.059 drand48 1287*c42dbd0eSchristos 0.008 0.011 _int_malloc 1288*c42dbd0eSchristos 0.003 0.003 brk 1289*c42dbd0eSchristos 0. 0.003 __default_morecore 1290*c42dbd0eSchristos 0. 0.114 __libc_start_main 1291*c42dbd0eSchristos 1292*c42dbd0eSchristos 1293*c42dbd0eSchristosFile: gprofng.info, Node: Commands Specific to Multithreading, Prev: Creating a Multithreading Experiment, Up: Support for Multithreading 1294*c42dbd0eSchristos 1295*c42dbd0eSchristos3.2.2 Commands Specific to Multithreading 1296*c42dbd0eSchristos----------------------------------------- 1297*c42dbd0eSchristos 1298*c42dbd0eSchristosThe function overview shown above shows the results aggregated over all 1299*c42dbd0eSchristosthe threads. The interesting new element is that we can also look at 1300*c42dbd0eSchristosthe performance data for the individual threads. 1301*c42dbd0eSchristos 1302*c42dbd0eSchristos The 'thread_list' command displays how many threads have been used: 1303*c42dbd0eSchristos 1304*c42dbd0eSchristos $ gprofng display text -thread_list mxv.2.thr.er 1305*c42dbd0eSchristos 1306*c42dbd0eSchristos This produces the following output, showing that three threads have 1307*c42dbd0eSchristosbeen used: 1308*c42dbd0eSchristos 1309*c42dbd0eSchristos Exp Sel Total 1310*c42dbd0eSchristos === === ===== 1311*c42dbd0eSchristos 1 all 3 1312*c42dbd0eSchristos 1313*c42dbd0eSchristos The output confirms there is one experiment and that by default all 1314*c42dbd0eSchristosthreads are selected. 1315*c42dbd0eSchristos 1316*c42dbd0eSchristos It may seem surprising to see three threads here, since we used the 1317*c42dbd0eSchristos'-t 2' option, but it is common for a Pthreads program to use one 1318*c42dbd0eSchristosadditional thread. This is typically the thread that runs from start to 1319*c42dbd0eSchristosfinish and handles the sequential portions of the code, as well as takes 1320*c42dbd0eSchristoscare of managing the threads. 1321*c42dbd0eSchristos 1322*c42dbd0eSchristos It is no different in our example code. At some point, the main 1323*c42dbd0eSchristosthread creates and activates the two threads that perform the 1324*c42dbd0eSchristosmultiplication of the matrix with the vector. Upon completion of this 1325*c42dbd0eSchristoscomputation, the main thread continues. 1326*c42dbd0eSchristos 1327*c42dbd0eSchristos The 'threads' command is simple, yet very powerful. It shows the 1328*c42dbd0eSchristostotal value of the metrics for each thread. To make it easier to 1329*c42dbd0eSchristosinterpret the data, we modify the metrics to include percentages: 1330*c42dbd0eSchristos 1331*c42dbd0eSchristos $ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er 1332*c42dbd0eSchristos 1333*c42dbd0eSchristos The command above produces the following overview: 1334*c42dbd0eSchristos 1335*c42dbd0eSchristos Current metrics: e.%totalcpu:name 1336*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 1337*c42dbd0eSchristos Objects sorted by metric: Exclusive Total CPU Time 1338*c42dbd0eSchristos 1339*c42dbd0eSchristos Excl. Total Name 1340*c42dbd0eSchristos CPU 1341*c42dbd0eSchristos sec. % 1342*c42dbd0eSchristos 2.258 100.00 <Total> 1343*c42dbd0eSchristos 1.075 47.59 Process 1, Thread 3 1344*c42dbd0eSchristos 1.070 47.37 Process 1, Thread 2 1345*c42dbd0eSchristos 0.114 5.03 Process 1, Thread 1 1346*c42dbd0eSchristos 1347*c42dbd0eSchristos The first line gives the total CPU time accumulated over the threads 1348*c42dbd0eSchristosselected. This is followed by the metric value(s) for each thread. 1349*c42dbd0eSchristos 1350*c42dbd0eSchristos From this it is clear that the main thread is responsible for 5% of 1351*c42dbd0eSchristosthe total CPU time, while the other two threads take 47% each. 1352*c42dbd0eSchristos 1353*c42dbd0eSchristos This view is ideally suited to verify if there any load balancing 1354*c42dbd0eSchristosissues and also to find the most time consuming thread(s). 1355*c42dbd0eSchristos 1356*c42dbd0eSchristos While useful, often more information than this is needed. This is 1357*c42dbd0eSchristoswhere the thread selection filter comes in. Through the 'thread_select' 1358*c42dbd0eSchristoscommand, one or more threads may be selected (*Note The Selection List:: 1359*c42dbd0eSchristoshow to define the selection list). 1360*c42dbd0eSchristos 1361*c42dbd0eSchristos Since it is most common to use this command in a script, we do so as 1362*c42dbd0eSchristoswell here. Below the script we are using: 1363*c42dbd0eSchristos 1364*c42dbd0eSchristos # Define the metrics 1365*c42dbd0eSchristos metrics e.%totalcpu 1366*c42dbd0eSchristos # Limit the output to 10 lines 1367*c42dbd0eSchristos limit 10 1368*c42dbd0eSchristos # Get the function overview for thread 1 1369*c42dbd0eSchristos thread_select 1 1370*c42dbd0eSchristos functions 1371*c42dbd0eSchristos # Get the function overview for thread 2 1372*c42dbd0eSchristos thread_select 2 1373*c42dbd0eSchristos functions 1374*c42dbd0eSchristos # Get the function overview for thread 3 1375*c42dbd0eSchristos thread_select 3 1376*c42dbd0eSchristos functions 1377*c42dbd0eSchristos 1378*c42dbd0eSchristos The definition of the metrics and the output limiter has been shown 1379*c42dbd0eSchristosand explained before and will be ignored. The new command we focus on 1380*c42dbd0eSchristosis 'thread_select'. 1381*c42dbd0eSchristos 1382*c42dbd0eSchristos This command takes a list (*Note The Selection List::) to select 1383*c42dbd0eSchristosspecific threads. In this case we simply use the individual thread 1384*c42dbd0eSchristosnumbers that we obtained with the 'thread_list' command earlier. 1385*c42dbd0eSchristos 1386*c42dbd0eSchristos This restricts the output of the 'functions' command to the thread 1387*c42dbd0eSchristosnumber(s) specified. This means that the script above shows which 1388*c42dbd0eSchristosfunction(s) each thread executes and how much CPU time they consumed. 1389*c42dbd0eSchristosBoth the timings and their percentages are given. 1390*c42dbd0eSchristos 1391*c42dbd0eSchristos This is the relevant part of the output for the first thread: 1392*c42dbd0eSchristos 1393*c42dbd0eSchristos # Get the function overview for thread 1 1394*c42dbd0eSchristos Exp Sel Total 1395*c42dbd0eSchristos === === ===== 1396*c42dbd0eSchristos 1 1 3 1397*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1398*c42dbd0eSchristos 1399*c42dbd0eSchristos Excl. Total Name 1400*c42dbd0eSchristos CPU 1401*c42dbd0eSchristos sec. % 1402*c42dbd0eSchristos 0.114 100.00 <Total> 1403*c42dbd0eSchristos 0.051 44.74 init_data 1404*c42dbd0eSchristos 0.028 24.56 erand48_r 1405*c42dbd0eSchristos 0.017 14.91 __drand48_iterate 1406*c42dbd0eSchristos 0.010 8.77 _int_malloc 1407*c42dbd0eSchristos 0.008 7.02 drand48 1408*c42dbd0eSchristos 0. 0. __libc_start_main 1409*c42dbd0eSchristos 0. 0. allocate_data 1410*c42dbd0eSchristos 0. 0. main 1411*c42dbd0eSchristos 0. 0. malloc 1412*c42dbd0eSchristos 1413*c42dbd0eSchristos As usual, the comment lines are echoed. This is followed by a 1414*c42dbd0eSchristosconfirmation of our selection. We see that indeed thread 1 has been 1415*c42dbd0eSchristosselected. What is displayed next is the function overview for this 1416*c42dbd0eSchristosparticular thread. Due to the 'limit 10' command, there are ten entries 1417*c42dbd0eSchristosin this list. 1418*c42dbd0eSchristos 1419*c42dbd0eSchristos Below are the overviews for threads 2 and 3 respectively. We see 1420*c42dbd0eSchristosthat all of the CPU time is spent in function 'mxv_core' and that this 1421*c42dbd0eSchristostime is approximately the same for both threads. 1422*c42dbd0eSchristos 1423*c42dbd0eSchristos # Get the function overview for thread 2 1424*c42dbd0eSchristos Exp Sel Total 1425*c42dbd0eSchristos === === ===== 1426*c42dbd0eSchristos 1 2 3 1427*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1428*c42dbd0eSchristos 1429*c42dbd0eSchristos Excl. Total Name 1430*c42dbd0eSchristos CPU 1431*c42dbd0eSchristos sec. % 1432*c42dbd0eSchristos 1.072 100.00 <Total> 1433*c42dbd0eSchristos 1.072 100.00 mxv_core 1434*c42dbd0eSchristos 0. 0. collector_root 1435*c42dbd0eSchristos 0. 0. driver_mxv 1436*c42dbd0eSchristos 1437*c42dbd0eSchristos # Get the function overview for thread 3 1438*c42dbd0eSchristos Exp Sel Total 1439*c42dbd0eSchristos === === ===== 1440*c42dbd0eSchristos 1 3 3 1441*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1442*c42dbd0eSchristos 1443*c42dbd0eSchristos Excl. Total Name 1444*c42dbd0eSchristos CPU 1445*c42dbd0eSchristos sec. % 1446*c42dbd0eSchristos 1.076 100.00 <Total> 1447*c42dbd0eSchristos 1.076 100.00 mxv_core 1448*c42dbd0eSchristos 0. 0. collector_root 1449*c42dbd0eSchristos 0. 0. driver_mxv 1450*c42dbd0eSchristos 1451*c42dbd0eSchristos When analyzing the performance of a multithreaded application, it is 1452*c42dbd0eSchristossometimes useful to know whether threads have mostly executed on the 1453*c42dbd0eSchristossame core, say, or if they have wandered across multiple cores. This 1454*c42dbd0eSchristossort of stickiness is usually referred to as _thread affinity_. 1455*c42dbd0eSchristos 1456*c42dbd0eSchristos Similar to the commands for the threads, there are several commands 1457*c42dbd0eSchristosrelated to the usage of the cores, or _CPUs_ as they are called in 1458*c42dbd0eSchristos'gprofng' (*Note The Concept of a CPU in gprofng::). 1459*c42dbd0eSchristos 1460*c42dbd0eSchristos In order to have some more interesting data to look at, we created a 1461*c42dbd0eSchristosnew experiment, this time using 8 threads: 1462*c42dbd0eSchristos 1463*c42dbd0eSchristos $ exe=mxv-pthreads.exe 1464*c42dbd0eSchristos $ m=3000 1465*c42dbd0eSchristos $ n=2000 1466*c42dbd0eSchristos $ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8 1467*c42dbd0eSchristos 1468*c42dbd0eSchristos Similar to the 'thread_list' command, the 'cpu_list' command displays 1469*c42dbd0eSchristoshow many CPUs have been used. The equivalent of the 'threads' threads 1470*c42dbd0eSchristoscommand, is the 'cpus' command, which shows the CPU numbers that were 1471*c42dbd0eSchristosused and how much time was spent on each of them. Both are demonstrated 1472*c42dbd0eSchristosbelow. 1473*c42dbd0eSchristos 1474*c42dbd0eSchristos $ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er 1475*c42dbd0eSchristos 1476*c42dbd0eSchristos This command produces the following output: 1477*c42dbd0eSchristos 1478*c42dbd0eSchristos Current metrics: e.%totalcpu:name 1479*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 1480*c42dbd0eSchristos Exp Sel Total 1481*c42dbd0eSchristos === === ===== 1482*c42dbd0eSchristos 1 all 10 1483*c42dbd0eSchristos Objects sorted by metric: Exclusive Total CPU Time 1484*c42dbd0eSchristos 1485*c42dbd0eSchristos Excl. Total Name 1486*c42dbd0eSchristos CPU 1487*c42dbd0eSchristos sec. % 1488*c42dbd0eSchristos 2.310 100.00 <Total> 1489*c42dbd0eSchristos 0.286 12.39 CPU 7 1490*c42dbd0eSchristos 0.284 12.30 CPU 13 1491*c42dbd0eSchristos 0.282 12.21 CPU 5 1492*c42dbd0eSchristos 0.280 12.13 CPU 14 1493*c42dbd0eSchristos 0.266 11.52 CPU 9 1494*c42dbd0eSchristos 0.265 11.48 CPU 2 1495*c42dbd0eSchristos 0.264 11.44 CPU 11 1496*c42dbd0eSchristos 0.194 8.42 CPU 0 1497*c42dbd0eSchristos 0.114 4.92 CPU 1 1498*c42dbd0eSchristos 0.074 3.19 CPU 15 1499*c42dbd0eSchristos 1500*c42dbd0eSchristos What we see in this table is that a total of 10 CPUs have been used. 1501*c42dbd0eSchristosThis is followed by a list with all the CPU numbers that have been used 1502*c42dbd0eSchristosduring the run. For each CPU it is shown how much time was spent on it. 1503*c42dbd0eSchristos 1504*c42dbd0eSchristos While the table with thread times shown earlier may point at a load 1505*c42dbd0eSchristosimbalance in the application, this overview has a different purpose. 1506*c42dbd0eSchristos 1507*c42dbd0eSchristos For example, we see that 10 CPUs have been used, but we know that the 1508*c42dbd0eSchristosapplication uses 9 threads only. This means that at least one thread 1509*c42dbd0eSchristoshas executed on more than one CPU. In itself this is not something to 1510*c42dbd0eSchristosworry about, but warrants a deeper investigation. 1511*c42dbd0eSchristos 1512*c42dbd0eSchristos Honesty dictates that next we performed a pre-analysis to find out 1513*c42dbd0eSchristoswhich thread(s) have been running on more than one CPU. We found this to 1514*c42dbd0eSchristosbe thread 7. It has executed on CPUs 0 and 15. 1515*c42dbd0eSchristos 1516*c42dbd0eSchristos With this knowledge, we wrote the script shown below. It zooms in on 1517*c42dbd0eSchristosthe behaviour of thread 7. 1518*c42dbd0eSchristos 1519*c42dbd0eSchristos # Define the metrics 1520*c42dbd0eSchristos metrics e.%totalcpu 1521*c42dbd0eSchristos # Limit the output to 10 lines 1522*c42dbd0eSchristos limit 10 1523*c42dbd0eSchristos functions 1524*c42dbd0eSchristos # Get the function overview for CPU 0 1525*c42dbd0eSchristos cpu_select 0 1526*c42dbd0eSchristos functions 1527*c42dbd0eSchristos # Get the function overview for CPU 15 1528*c42dbd0eSchristos cpu_select 15 1529*c42dbd0eSchristos functions 1530*c42dbd0eSchristos 1531*c42dbd0eSchristos From the earlier shown threads overview, we know that thread 7 has 1532*c42dbd0eSchristosused '0.268' seconds of CPU time.. 1533*c42dbd0eSchristos 1534*c42dbd0eSchristos By selecting CPUs 0 and 15, respectively, we get the following 1535*c42dbd0eSchristosfunction overviews: 1536*c42dbd0eSchristos 1537*c42dbd0eSchristos # Get the function overview for CPU 0 1538*c42dbd0eSchristos Exp Sel Total 1539*c42dbd0eSchristos === === ===== 1540*c42dbd0eSchristos 1 0 10 1541*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1542*c42dbd0eSchristos 1543*c42dbd0eSchristos Excl. Total Name 1544*c42dbd0eSchristos CPU 1545*c42dbd0eSchristos sec. % 1546*c42dbd0eSchristos 0.194 100.00 <Total> 1547*c42dbd0eSchristos 0.194 100.00 mxv_core 1548*c42dbd0eSchristos 0. 0. collector_root 1549*c42dbd0eSchristos 0. 0. driver_mxv 1550*c42dbd0eSchristos 1551*c42dbd0eSchristos # Get the function overview for CPU 15 1552*c42dbd0eSchristos Exp Sel Total 1553*c42dbd0eSchristos === === ===== 1554*c42dbd0eSchristos 1 15 10 1555*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1556*c42dbd0eSchristos 1557*c42dbd0eSchristos Excl. Total Name 1558*c42dbd0eSchristos CPU 1559*c42dbd0eSchristos sec. % 1560*c42dbd0eSchristos 0.074 100.00 <Total> 1561*c42dbd0eSchristos 0.074 100.00 mxv_core 1562*c42dbd0eSchristos 0. 0. collector_root 1563*c42dbd0eSchristos 0. 0. driver_mxv 1564*c42dbd0eSchristos 1565*c42dbd0eSchristos This shows that thread 7 spent '0.194' seconds on CPU 0 and '0.074' 1566*c42dbd0eSchristosseconds on CPU 15. 1567*c42dbd0eSchristos 1568*c42dbd0eSchristos 1569*c42dbd0eSchristosFile: gprofng.info, Node: Viewing Multiple Experiments, Next: Profile Hardware Event Counters, Prev: Support for Multithreading, Up: A Mini Tutorial 1570*c42dbd0eSchristos 1571*c42dbd0eSchristos3.3 Viewing Multiple Experiments 1572*c42dbd0eSchristos================================ 1573*c42dbd0eSchristos 1574*c42dbd0eSchristosOne thing we did not cover sofar is that 'gprofng' fully supports the 1575*c42dbd0eSchristosanalysis of multiple experiments. The 'gprofng display text' tool 1576*c42dbd0eSchristosaccepts a list of experiments. The data can either be aggregated across 1577*c42dbd0eSchristosthe experiments, or used in a comparison. 1578*c42dbd0eSchristos 1579*c42dbd0eSchristos Mention 'experiment_list' 1580*c42dbd0eSchristos 1581*c42dbd0eSchristos* Menu: 1582*c42dbd0eSchristos 1583*c42dbd0eSchristos* Aggregation of Experiments:: 1584*c42dbd0eSchristos* Comparison of Experiments:: 1585*c42dbd0eSchristos 1586*c42dbd0eSchristos 1587*c42dbd0eSchristosFile: gprofng.info, Node: Aggregation of Experiments, Next: Comparison of Experiments, Up: Viewing Multiple Experiments 1588*c42dbd0eSchristos 1589*c42dbd0eSchristos3.3.1 Aggregation of Experiments 1590*c42dbd0eSchristos-------------------------------- 1591*c42dbd0eSchristos 1592*c42dbd0eSchristosBy default, the data for multiple experiments is aggregrated and the 1593*c42dbd0eSchristosdisplay commands shows these combined results. 1594*c42dbd0eSchristos 1595*c42dbd0eSchristos For example, we can aggregate the data for our single and dual thread 1596*c42dbd0eSchristosexperiments. Below is the script we used for this: 1597*c42dbd0eSchristos 1598*c42dbd0eSchristos # Define the metrics 1599*c42dbd0eSchristos metrics e.%totalcpu 1600*c42dbd0eSchristos # Limit the output to 10 lines 1601*c42dbd0eSchristos limit 10 1602*c42dbd0eSchristos # Get the list with experiments 1603*c42dbd0eSchristos experiment_list 1604*c42dbd0eSchristos # Get the function overview 1605*c42dbd0eSchristos functions 1606*c42dbd0eSchristos 1607*c42dbd0eSchristos With the exception of the 'experiment_list' command, all commands 1608*c42dbd0eSchristosused have been discussed earlier. 1609*c42dbd0eSchristos 1610*c42dbd0eSchristos The 'experiment_list' command provides a list of the experiments that 1611*c42dbd0eSchristoshave been loaded. This is is used to verify we are looking at the 1612*c42dbd0eSchristosexperiments we intend to aggregate. 1613*c42dbd0eSchristos 1614*c42dbd0eSchristos $ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er 1615*c42dbd0eSchristos 1616*c42dbd0eSchristos With the command above, we get the following output: 1617*c42dbd0eSchristos 1618*c42dbd0eSchristos # Define the metrics 1619*c42dbd0eSchristos Current metrics: e.%totalcpu:name 1620*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 1621*c42dbd0eSchristos # Limit the output to 10 lines 1622*c42dbd0eSchristos Print limit set to 10 1623*c42dbd0eSchristos # Get the list with experiments 1624*c42dbd0eSchristos ID Sel PID Experiment 1625*c42dbd0eSchristos == === ===== ============ 1626*c42dbd0eSchristos 1 yes 30591 mxv.1.thr.er 1627*c42dbd0eSchristos 2 yes 11629 mxv.2.thr.er 1628*c42dbd0eSchristos # Get the function overview 1629*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1630*c42dbd0eSchristos 1631*c42dbd0eSchristos Excl. Total Name 1632*c42dbd0eSchristos CPU 1633*c42dbd0eSchristos sec. % 1634*c42dbd0eSchristos 4.533 100.00 <Total> 1635*c42dbd0eSchristos 4.306 94.99 mxv_core 1636*c42dbd0eSchristos 0.105 2.31 init_data 1637*c42dbd0eSchristos 0.053 1.17 erand48_r 1638*c42dbd0eSchristos 0.027 0.59 __drand48_iterate 1639*c42dbd0eSchristos 0.021 0.46 _int_malloc 1640*c42dbd0eSchristos 0.021 0.46 drand48 1641*c42dbd0eSchristos 0.001 0.02 sysmalloc 1642*c42dbd0eSchristos 0. 0. __libc_start_main 1643*c42dbd0eSchristos 0. 0. allocate_data 1644*c42dbd0eSchristos 1645*c42dbd0eSchristos The first five lines should look familiar. The five lines following, 1646*c42dbd0eSchristosecho the comment line in the script and show the overview of the 1647*c42dbd0eSchristosexperiments. This confirms two experiments have been loaded and that 1648*c42dbd0eSchristosboth are active. 1649*c42dbd0eSchristos 1650*c42dbd0eSchristos This is followed by the function overview. The timings have been 1651*c42dbd0eSchristossummed up and the percentages are adjusted accordingly. For example, 1652*c42dbd0eSchristosthe total accumulated time is indeed 2.272 + 2.261 = 4.533 seconds. 1653*c42dbd0eSchristos 1654*c42dbd0eSchristos 1655*c42dbd0eSchristosFile: gprofng.info, Node: Comparison of Experiments, Prev: Aggregation of Experiments, Up: Viewing Multiple Experiments 1656*c42dbd0eSchristos 1657*c42dbd0eSchristos3.3.2 Comparison of Experiments 1658*c42dbd0eSchristos------------------------------- 1659*c42dbd0eSchristos 1660*c42dbd0eSchristosThe support for multiple experiments really shines in comparison mode. 1661*c42dbd0eSchristosThis feature is enabled through the command 'compare on' and is disabled 1662*c42dbd0eSchristosby setting 'compare off'. 1663*c42dbd0eSchristos 1664*c42dbd0eSchristos In comparison mode, the data for the various experiments is shown 1665*c42dbd0eSchristosside by side, as illustrated below where we compare the results for the 1666*c42dbd0eSchristosmultithreaded experiments using one and two threads respectively: 1667*c42dbd0eSchristos 1668*c42dbd0eSchristos $ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er 1669*c42dbd0eSchristos 1670*c42dbd0eSchristosThis produces the following output: 1671*c42dbd0eSchristos 1672*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 1673*c42dbd0eSchristos 1674*c42dbd0eSchristos mxv.1.thr.er mxv.2.thr.er mxv.1.thr.er mxv.2.thr.er 1675*c42dbd0eSchristos Excl. Total Excl. Total Incl. Total Incl. Total Name 1676*c42dbd0eSchristos CPU CPU CPU CPU 1677*c42dbd0eSchristos sec. sec. sec. sec. 1678*c42dbd0eSchristos 2.272 2.261 2.272 2.261 <Total> 1679*c42dbd0eSchristos 2.159 2.148 2.159 2.148 mxv_core 1680*c42dbd0eSchristos 0.054 0.051 0.102 0.104 init_data 1681*c42dbd0eSchristos 0.025 0.028 0.035 0.045 erand48_r 1682*c42dbd0eSchristos 0.013 0.008 0.048 0.053 drand48 1683*c42dbd0eSchristos 0.011 0.010 0.012 0.010 _int_malloc 1684*c42dbd0eSchristos 0.010 0.017 0.010 0.017 __drand48_iterate 1685*c42dbd0eSchristos 0.001 0. 0.001 0. sysmalloc 1686*c42dbd0eSchristos 0. 0. 0.114 0.114 __libc_start_main 1687*c42dbd0eSchristos 0. 0. 0.011 0.010 allocate_data 1688*c42dbd0eSchristos 0. 0. 0.001 0. check_results 1689*c42dbd0eSchristos 0. 0. 2.159 2.148 collector_root 1690*c42dbd0eSchristos 0. 0. 2.159 2.148 driver_mxv 1691*c42dbd0eSchristos 0. 0. 0.114 0.114 main 1692*c42dbd0eSchristos 0. 0. 0.012 0.010 malloc 1693*c42dbd0eSchristos 1694*c42dbd0eSchristos This table is already helpful to more easily compare (two) profiles, 1695*c42dbd0eSchristosbut there is more that we can do here. 1696*c42dbd0eSchristos 1697*c42dbd0eSchristos By default, in comparison mode, all measured values are shown. Often 1698*c42dbd0eSchristosprofiling is about comparing performance data. It is therefore more 1699*c42dbd0eSchristosuseful to look at differences, or ratios, using one experiment as a 1700*c42dbd0eSchristosreference. 1701*c42dbd0eSchristos 1702*c42dbd0eSchristos The values shown are relative to this difference. For example if a 1703*c42dbd0eSchristosratio is below one, it means the reference value was higher. 1704*c42dbd0eSchristos 1705*c42dbd0eSchristos This feature is supported on the 'compare' command. In addition to 1706*c42dbd0eSchristos'on', or 'off', this command also supports 'delta', or 'ratio'. 1707*c42dbd0eSchristos 1708*c42dbd0eSchristos Usage of one of these two keywords enables the comparison feature and 1709*c42dbd0eSchristosshows either the difference, or the ratio, relative to the reference 1710*c42dbd0eSchristosdata. 1711*c42dbd0eSchristos 1712*c42dbd0eSchristos In the example below, we use the same two experiments used in the 1713*c42dbd0eSchristoscomparison above, but as before, the number of lines is restricted to 10 1714*c42dbd0eSchristosand we focus on the exclusive timings plus percentages. For the 1715*c42dbd0eSchristoscomparison part we are interested in the differences. 1716*c42dbd0eSchristos 1717*c42dbd0eSchristos This is the script that produces such an overview: 1718*c42dbd0eSchristos 1719*c42dbd0eSchristos # Define the metrics 1720*c42dbd0eSchristos metrics e.%totalcpu 1721*c42dbd0eSchristos # Limit the output to 10 lines 1722*c42dbd0eSchristos limit 10 1723*c42dbd0eSchristos # Set the comparison mode to differences 1724*c42dbd0eSchristos compare delta 1725*c42dbd0eSchristos # Get the function overview 1726*c42dbd0eSchristos functions 1727*c42dbd0eSchristos 1728*c42dbd0eSchristos Assuming this script file is called 'my-script-comp', this is how we 1729*c42dbd0eSchristosget the table displayed on our screen: 1730*c42dbd0eSchristos 1731*c42dbd0eSchristos $ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er 1732*c42dbd0eSchristos 1733*c42dbd0eSchristos Leaving out some of the lines printed, but we have seen before, we 1734*c42dbd0eSchristosget the following table: 1735*c42dbd0eSchristos 1736*c42dbd0eSchristos mxv.1.thr.er mxv.2.thr.er 1737*c42dbd0eSchristos Excl. Total Excl. Total Name 1738*c42dbd0eSchristos CPU CPU 1739*c42dbd0eSchristos sec. % delta % 1740*c42dbd0eSchristos 2.272 100.00 -0.011 100.00 <Total> 1741*c42dbd0eSchristos 2.159 95.00 -0.011 94.97 mxv_core 1742*c42dbd0eSchristos 0.054 2.37 -0.003 2.25 init_data 1743*c42dbd0eSchristos 0.025 1.10 +0.003 1.23 erand48_r 1744*c42dbd0eSchristos 0.013 0.57 -0.005 0.35 drand48 1745*c42dbd0eSchristos 0.011 0.48 -0.001 0.44 _int_malloc 1746*c42dbd0eSchristos 0.010 0.44 +0.007 0.75 __drand48_iterate 1747*c42dbd0eSchristos 0.001 0.04 -0.001 0. sysmalloc 1748*c42dbd0eSchristos 0. 0. +0. 0. __libc_start_main 1749*c42dbd0eSchristos 0. 0. +0. 0. allocate_data 1750*c42dbd0eSchristos 1751*c42dbd0eSchristos It is now easy to see that the CPU times for the most time consuming 1752*c42dbd0eSchristosfunctions in this code are practically the same. 1753*c42dbd0eSchristos 1754*c42dbd0eSchristos While in this case we used the delta as a comparison, 1755*c42dbd0eSchristos 1756*c42dbd0eSchristos Note that the comparison feature is supported at the function, 1757*c42dbd0eSchristossource, and disassembly level. There is no practical limit on the 1758*c42dbd0eSchristosnumber of experiments that can be used in a comparison. 1759*c42dbd0eSchristos 1760*c42dbd0eSchristos 1761*c42dbd0eSchristosFile: gprofng.info, Node: Profile Hardware Event Counters, Next: Java Profiling, Prev: Viewing Multiple Experiments, Up: A Mini Tutorial 1762*c42dbd0eSchristos 1763*c42dbd0eSchristos3.4 Profile Hardware Event Counters 1764*c42dbd0eSchristos=================================== 1765*c42dbd0eSchristos 1766*c42dbd0eSchristosMany processors provide a set of hardware event counters and 'gprofng' 1767*c42dbd0eSchristosprovides support for this feature. *Note Hardware Event Counters 1768*c42dbd0eSchristosExplained:: for those readers that are not familiar with such counters 1769*c42dbd0eSchristosand like to learn more. 1770*c42dbd0eSchristos 1771*c42dbd0eSchristos In this section we explain how to get the details on the event 1772*c42dbd0eSchristoscounter support for the processor used in the experiment(s), and show 1773*c42dbd0eSchristosseveral examples. 1774*c42dbd0eSchristos 1775*c42dbd0eSchristos* Menu: 1776*c42dbd0eSchristos 1777*c42dbd0eSchristos* Getting Information on the Counters Supported:: 1778*c42dbd0eSchristos* Examples Using Hardware Event Counters:: 1779*c42dbd0eSchristos 1780*c42dbd0eSchristos 1781*c42dbd0eSchristosFile: gprofng.info, Node: Getting Information on the Counters Supported, Next: Examples Using Hardware Event Counters, Up: Profile Hardware Event Counters 1782*c42dbd0eSchristos 1783*c42dbd0eSchristos3.4.1 Getting Information on the Counters Supported 1784*c42dbd0eSchristos--------------------------------------------------- 1785*c42dbd0eSchristos 1786*c42dbd0eSchristosThe first step is to check if the processor used for the experiments is 1787*c42dbd0eSchristossupported by 'gprofng'. 1788*c42dbd0eSchristos 1789*c42dbd0eSchristos The '-h' option on 'gprofng collect app' will show the event counter 1790*c42dbd0eSchristosinformation: 1791*c42dbd0eSchristos 1792*c42dbd0eSchristos $ gprofng collect app -h 1793*c42dbd0eSchristos 1794*c42dbd0eSchristos In case the counters are supported, a list with the events is 1795*c42dbd0eSchristosprinted. Otherwise, a warning message will be issued. 1796*c42dbd0eSchristos 1797*c42dbd0eSchristos For example, below we show this command and the output on an Intel 1798*c42dbd0eSchristosXeon Platinum 8167M (aka "Skylake") processor. The output has been 1799*c42dbd0eSchristossplit into several sections and each section is commented upon 1800*c42dbd0eSchristosseparately. 1801*c42dbd0eSchristos 1802*c42dbd0eSchristos Run "gprofng collect app --help" for a usage message. 1803*c42dbd0eSchristos 1804*c42dbd0eSchristos Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85' 1805*c42dbd0eSchristos (cpuver=2499): 1806*c42dbd0eSchristos 1807*c42dbd0eSchristos -h {auto|lo|on|hi} 1808*c42dbd0eSchristos turn on default set of HW counters at the specified rate 1809*c42dbd0eSchristos -h <ctr_def> [-h <ctr_def>]... 1810*c42dbd0eSchristos -h <ctr_def>[,<ctr_def>]... 1811*c42dbd0eSchristos specify HW counter profiling for up to 4 HW counters 1812*c42dbd0eSchristos 1813*c42dbd0eSchristos The first line shows how to get a usage overview. This is followed 1814*c42dbd0eSchristosby some information on the target processor. 1815*c42dbd0eSchristos 1816*c42dbd0eSchristos The next five lines explain in what ways the '-h' option can be used 1817*c42dbd0eSchristosto define the events to be monitored. 1818*c42dbd0eSchristos 1819*c42dbd0eSchristos The first version shown above enables a default set of counters. 1820*c42dbd0eSchristosThis default depends on the processor this command is executed on. The 1821*c42dbd0eSchristoskeyword following the '-h' option defines the sampling rate: 1822*c42dbd0eSchristos 1823*c42dbd0eSchristos'auto' 1824*c42dbd0eSchristos Match the sample rate of used by clock profiling. If the latter is 1825*c42dbd0eSchristos disabled, Use a per thread sampling rate of approximately 100 1826*c42dbd0eSchristos samples per second. This setting is the default and preferred. 1827*c42dbd0eSchristos 1828*c42dbd0eSchristos'on' 1829*c42dbd0eSchristos Use a per thread sampling rate of approximately 100 samples per 1830*c42dbd0eSchristos second. 1831*c42dbd0eSchristos 1832*c42dbd0eSchristos'lo' 1833*c42dbd0eSchristos Use a per thread sampling rate of approximately 10 samples per 1834*c42dbd0eSchristos second. 1835*c42dbd0eSchristos 1836*c42dbd0eSchristos'hi' 1837*c42dbd0eSchristos Use a per thread sampling rate of approximately 1000 samples per 1838*c42dbd0eSchristos second. 1839*c42dbd0eSchristos 1840*c42dbd0eSchristos The second and third variant define the events to be monitored. Note 1841*c42dbd0eSchristosthat the number of simultaneous events supported is printed. In this 1842*c42dbd0eSchristoscase we can monitor four events in a single profiling job. 1843*c42dbd0eSchristos 1844*c42dbd0eSchristos It is a matter of preference whether you like to use the '-h' option 1845*c42dbd0eSchristosfor each event, or use it once, followed by a comma separated list. 1846*c42dbd0eSchristos 1847*c42dbd0eSchristos There is one slight catch though. The counter definition below has 1848*c42dbd0eSchristosmandatory comma (',') between the event and the rate. While a default 1849*c42dbd0eSchristoscan be used for the rate, the comma cannot be omitted. This may result 1850*c42dbd0eSchristosin a somewhat awkward counter definition in case the default sampling 1851*c42dbd0eSchristosrate is used. 1852*c42dbd0eSchristos 1853*c42dbd0eSchristos For example, the following two commands are equivalent. Note the 1854*c42dbd0eSchristosdouble comma in the second command. This is not a typo. 1855*c42dbd0eSchristos 1856*c42dbd0eSchristos $ gprofng collect app -h cycles -h insts ... 1857*c42dbd0eSchristos $ gprofng collect app -h cycles,,insts ... 1858*c42dbd0eSchristos 1859*c42dbd0eSchristos In the first command this comma is not needed, because a comma 1860*c42dbd0eSchristos("','") immediately followed by white space may be omitted. 1861*c42dbd0eSchristos 1862*c42dbd0eSchristos This is why we prefer the this syntax and in the remainder will use 1863*c42dbd0eSchristosthe first version of this command. 1864*c42dbd0eSchristos 1865*c42dbd0eSchristos The counter definition takes an event name, plus optionally one or 1866*c42dbd0eSchristosmore attributes, followed by a comma, and optionally the sampling rate. 1867*c42dbd0eSchristosThe output section below shows the formal definition. 1868*c42dbd0eSchristos 1869*c42dbd0eSchristos <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>] 1870*c42dbd0eSchristos 1871*c42dbd0eSchristos The printed help then explains this syntax. Below we have summarized 1872*c42dbd0eSchristosand expanded this output: 1873*c42dbd0eSchristos 1874*c42dbd0eSchristos'<ctr>' 1875*c42dbd0eSchristos The counter name must be selected from the available counters 1876*c42dbd0eSchristos listed as part of the output printed with the '-h' option. On most 1877*c42dbd0eSchristos systems, if a counter is not listed, it may still be specified by 1878*c42dbd0eSchristos its numeric value. 1879*c42dbd0eSchristos 1880*c42dbd0eSchristos'~<attr>=<val>' 1881*c42dbd0eSchristos This is an optional attribute that depends on the processor. The 1882*c42dbd0eSchristos list of supported attributes is printed in the output. Examples of 1883*c42dbd0eSchristos attributes are "user", or "system". The value can given in decimal 1884*c42dbd0eSchristos or hexadecimal format. Multiple attributes may be specified, and 1885*c42dbd0eSchristos each must be preceded by a ~. 1886*c42dbd0eSchristos 1887*c42dbd0eSchristos'<rate>' 1888*c42dbd0eSchristos 1889*c42dbd0eSchristos The sampling rate is one of the following: 1890*c42dbd0eSchristos 1891*c42dbd0eSchristos 'auto' 1892*c42dbd0eSchristos This is the default and matches the rate used by clock 1893*c42dbd0eSchristos profiling. If clock profiling is disabled, use 'on'. 1894*c42dbd0eSchristos 1895*c42dbd0eSchristos 'on' 1896*c42dbd0eSchristos Set the per thread maximum sampling rate to ~100 1897*c42dbd0eSchristos samples/second 1898*c42dbd0eSchristos 1899*c42dbd0eSchristos 'lo' 1900*c42dbd0eSchristos Set the per thread maximum sampling rate to ~10 samples/second 1901*c42dbd0eSchristos 1902*c42dbd0eSchristos 'hi' 1903*c42dbd0eSchristos Set the per thread maximum sampling rate to ~1000 1904*c42dbd0eSchristos samples/second 1905*c42dbd0eSchristos 1906*c42dbd0eSchristos '<interval>' 1907*c42dbd0eSchristos Define the sampling interval. *Note Control the Sampling 1908*c42dbd0eSchristos Frequency:: how to define this. 1909*c42dbd0eSchristos 1910*c42dbd0eSchristos After the section with the formal definition of events and counters, 1911*c42dbd0eSchristosa processor specific list is displayed. This part starts with an 1912*c42dbd0eSchristosoverview of the default set of counters and the aliased names supported 1913*c42dbd0eSchristos_on this specific processor_. 1914*c42dbd0eSchristos 1915*c42dbd0eSchristos Default set of HW counters: 1916*c42dbd0eSchristos 1917*c42dbd0eSchristos -h cycles,,insts,,llm 1918*c42dbd0eSchristos 1919*c42dbd0eSchristos Aliases for most useful HW counters: 1920*c42dbd0eSchristos 1921*c42dbd0eSchristos alias raw name type units regs description 1922*c42dbd0eSchristos 1923*c42dbd0eSchristos cycles unhalted-core-cycles CPU-cycles 0123 CPU Cycles 1924*c42dbd0eSchristos insts instruction-retired events 0123 Instructions Executed 1925*c42dbd0eSchristos llm llc-misses events 0123 Last-Level Cache Misses 1926*c42dbd0eSchristos br_msp branch-misses-retired events 0123 Branch Mispredict 1927*c42dbd0eSchristos br_ins branch-instruction-retired events 0123 Branch Instructions 1928*c42dbd0eSchristos 1929*c42dbd0eSchristos The definitions given above may or may not be available on other 1930*c42dbd0eSchristosprocessors, but we try to maximize the overlap across alias sets. 1931*c42dbd0eSchristos 1932*c42dbd0eSchristos The table above shows the default set of counters defined for this 1933*c42dbd0eSchristosprocessor, and the aliases. For each alias the full "raw" name is 1934*c42dbd0eSchristosgiven, plus the unit of the number returned by the counter (CPU cycles, 1935*c42dbd0eSchristosor a raw count), the hardware counter the event is allowed to be mapped 1936*c42dbd0eSchristosonto, and a short description. 1937*c42dbd0eSchristos 1938*c42dbd0eSchristos The last part of the output contains all the events that can be 1939*c42dbd0eSchristosmonitored: 1940*c42dbd0eSchristos 1941*c42dbd0eSchristos Raw HW counters: 1942*c42dbd0eSchristos 1943*c42dbd0eSchristos name type units regs description 1944*c42dbd0eSchristos 1945*c42dbd0eSchristos unhalted-core-cycles CPU-cycles 0123 1946*c42dbd0eSchristos unhalted-reference-cycles events 0123 1947*c42dbd0eSchristos instruction-retired events 0123 1948*c42dbd0eSchristos llc-reference events 0123 1949*c42dbd0eSchristos llc-misses events 0123 1950*c42dbd0eSchristos branch-instruction-retired events 0123 1951*c42dbd0eSchristos branch-misses-retired events 0123 1952*c42dbd0eSchristos ld_blocks.store_forward events 0123 1953*c42dbd0eSchristos ld_blocks.no_sr events 0123 1954*c42dbd0eSchristos ld_blocks_partial.address_alias events 0123 1955*c42dbd0eSchristos dtlb_load_misses.miss_causes_a_walk events 0123 1956*c42dbd0eSchristos dtlb_load_misses.walk_completed_4k events 0123 1957*c42dbd0eSchristos 1958*c42dbd0eSchristos <many lines deleted> 1959*c42dbd0eSchristos 1960*c42dbd0eSchristos l2_lines_out.silent events 0123 1961*c42dbd0eSchristos l2_lines_out.non_silent events 0123 1962*c42dbd0eSchristos l2_lines_out.useless_hwpf events 0123 1963*c42dbd0eSchristos sq_misc.split_lock events 0123 1964*c42dbd0eSchristos 1965*c42dbd0eSchristos See Chapter 19 of the "Intel 64 and IA-32 Architectures Software 1966*c42dbd0eSchristos Developer's Manual Volume 3B: System Programming Guide" 1967*c42dbd0eSchristos 1968*c42dbd0eSchristos As can be seen, these names are not always easy to correlate to a 1969*c42dbd0eSchristosspecific event of interest. The processor manual should provide more 1970*c42dbd0eSchristosclarity on this. 1971*c42dbd0eSchristos 1972*c42dbd0eSchristos 1973*c42dbd0eSchristosFile: gprofng.info, Node: Examples Using Hardware Event Counters, Prev: Getting Information on the Counters Supported, Up: Profile Hardware Event Counters 1974*c42dbd0eSchristos 1975*c42dbd0eSchristos3.4.2 Examples Using Hardware Event Counters 1976*c42dbd0eSchristos-------------------------------------------- 1977*c42dbd0eSchristos 1978*c42dbd0eSchristosThe previous section may give the impression that these counters are 1979*c42dbd0eSchristoshard to use, but as we will show now, in practice it is quite simple. 1980*c42dbd0eSchristos 1981*c42dbd0eSchristos With the information from the '-h' option, we can easily set up our 1982*c42dbd0eSchristosfirst event counter experiment. 1983*c42dbd0eSchristos 1984*c42dbd0eSchristos We start by using the default set of counters defined for our 1985*c42dbd0eSchristosprocessor and we use 2 threads: 1986*c42dbd0eSchristos 1987*c42dbd0eSchristos $ exe=mxv-pthreads.exe 1988*c42dbd0eSchristos $ m=3000 1989*c42dbd0eSchristos $ n=2000 1990*c42dbd0eSchristos $ exp=mxv.hwc.def.2.thr.er 1991*c42dbd0eSchristos $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2 1992*c42dbd0eSchristos 1993*c42dbd0eSchristos The new option here is '-h auto'. The 'auto' keyword enables 1994*c42dbd0eSchristoshardware event counter profiling and selects the default set of counters 1995*c42dbd0eSchristosdefined for this processor. 1996*c42dbd0eSchristos 1997*c42dbd0eSchristos As before, we can display the information, but there is one practical 1998*c42dbd0eSchristoshurdle to take. Unless we like to view all metrics recorded, we would 1999*c42dbd0eSchristosneed to know the names of the events that have been enabled. This is 2000*c42dbd0eSchristostedious and also not portable in case we would like to repeat this 2001*c42dbd0eSchristosexperiment on another processor. 2002*c42dbd0eSchristos 2003*c42dbd0eSchristos This is where the special 'hwc' metric comes very handy. It 2004*c42dbd0eSchristosautomatically expands to the active set of events used. 2005*c42dbd0eSchristos 2006*c42dbd0eSchristos With this, it is very easy to display the event counter values. Note 2007*c42dbd0eSchristosthat although the regular clock based profiling was enabled, we only 2008*c42dbd0eSchristoswant to see the counter values. We also request to see the percentages 2009*c42dbd0eSchristosand limit the output to the first 5 lines: 2010*c42dbd0eSchristos 2011*c42dbd0eSchristos $ exp=mxv.hwc.def.2.thr.er 2012*c42dbd0eSchristos $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp 2013*c42dbd0eSchristos 2014*c42dbd0eSchristos Current metrics: e.%cycles:e+%insts:e+%llm:name 2015*c42dbd0eSchristos Current Sort Metric: Exclusive CPU Cycles ( e.%cycles ) 2016*c42dbd0eSchristos Print limit set to 5 2017*c42dbd0eSchristos Functions sorted by metric: Exclusive CPU Cycles 2018*c42dbd0eSchristos 2019*c42dbd0eSchristos Excl. CPU Excl. Instructions Excl. Last-Level Name 2020*c42dbd0eSchristos Cycles Executed Cache Misses 2021*c42dbd0eSchristos sec. % % % 2022*c42dbd0eSchristos 2.691 100.00 7906475309 100.00 122658983 100.00 <Total> 2023*c42dbd0eSchristos 2.598 96.54 7432724378 94.01 121745696 99.26 mxv_core 2024*c42dbd0eSchristos 0.035 1.31 188860269 2.39 70084 0.06 erand48_r 2025*c42dbd0eSchristos 0.026 0.95 73623396 0.93 763116 0.62 init_data 2026*c42dbd0eSchristos 0.018 0.66 76824434 0.97 40040 0.03 drand48 2027*c42dbd0eSchristos 2028*c42dbd0eSchristos As we have seen before, the first few lines echo the settings. This 2029*c42dbd0eSchristosincludes a list with the hardware event counters used by default. 2030*c42dbd0eSchristos 2031*c42dbd0eSchristos The table that follows makes it very easy to get an overview where 2032*c42dbd0eSchristosthe time is spent and how many of the target events have occurred. 2033*c42dbd0eSchristos 2034*c42dbd0eSchristos As before, we can drill down deeper and see the same metrics at the 2035*c42dbd0eSchristossource line and instruction level. Other than using 'hwc' in the 2036*c42dbd0eSchristosmetrics definitions, nothing has changed compared to the previous 2037*c42dbd0eSchristosexamples: 2038*c42dbd0eSchristos 2039*c42dbd0eSchristos $ exp=mxv.hwc.def.2.thr.er 2040*c42dbd0eSchristos $ gprofng display text -metrics e.hwc -source mxv_core $exp 2041*c42dbd0eSchristos 2042*c42dbd0eSchristos This is the relevant part of the output. Since the lines get very 2043*c42dbd0eSchristoslong, we have somewhat modified the lay-out: 2044*c42dbd0eSchristos 2045*c42dbd0eSchristos Excl. CPU Excl. Excl. 2046*c42dbd0eSchristos Cycles Instructions Last-Level 2047*c42dbd0eSchristos sec. Executed Cache Misses 2048*c42dbd0eSchristos <Function: mxv_core> 2049*c42dbd0eSchristos 0. 0 0 32. void __attribute__ ((noinline)) 2050*c42dbd0eSchristos mxv_core(...) 2051*c42dbd0eSchristos 0. 0 0 33. { 2052*c42dbd0eSchristos 0. 0 0 34. for (uint64_t i=...) { 2053*c42dbd0eSchristos 0. 0 0 35. double row_sum = 0.0; 2054*c42dbd0eSchristos ## 1.872 7291879319 88150571 36. for (int64_t j=0; j<n; j++) 2055*c42dbd0eSchristos 0.725 140845059 33595125 37. row_sum += A[i][j]*b[j]; 2056*c42dbd0eSchristos 0. 0 0 38. c[i] = row_sum; 2057*c42dbd0eSchristos 39. } 2058*c42dbd0eSchristos 0. 0 0 40. } 2059*c42dbd0eSchristos 2060*c42dbd0eSchristos In a smiliar way we can display the event counter values at the 2061*c42dbd0eSchristosinstruction level. Again we have modified the lay-out due to page width 2062*c42dbd0eSchristoslimitations: 2063*c42dbd0eSchristos 2064*c42dbd0eSchristos $ exp=mxv.hwc.def.2.thr.er 2065*c42dbd0eSchristos $ gprofng display text -metrics e.hwc -disasm mxv_core $exp 2066*c42dbd0eSchristos 2067*c42dbd0eSchristos Excl. CPU Excl. Excl. 2068*c42dbd0eSchristos Cycles Instructions Last-Level 2069*c42dbd0eSchristos sec. Executed Cache Misses 2070*c42dbd0eSchristos <Function: mxv_core> 2071*c42dbd0eSchristos 0. 0 0 [33] 4021ba: mov 0x8(%rsp),%r10 2072*c42dbd0eSchristos 34. for (uint64_t i=...) { 2073*c42dbd0eSchristos 0. 0 0 [34] 4021bf: cmp %rsi,%rdi 2074*c42dbd0eSchristos 0. 0 0 [34] 4021c2: jbe 0x37 2075*c42dbd0eSchristos 0. 0 0 [34] 4021c4: ret 2076*c42dbd0eSchristos 35. double row_sum = 0.0; 2077*c42dbd0eSchristos 36. for (int64_t j=0; j<n; j++) 2078*c42dbd0eSchristos 37. row_sum += A[i][j]*b[j]; 2079*c42dbd0eSchristos 0. 0 0 [37] 4021c5: mov (%r8,%rdi,8),%rdx 2080*c42dbd0eSchristos 0. 0 0 [36] 4021c9: mov $0x0,%eax 2081*c42dbd0eSchristos 0. 0 0 [35] 4021ce: pxor %xmm1,%xmm1 2082*c42dbd0eSchristos 0.002 12804230 321394 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0 2083*c42dbd0eSchristos 0.141 60819025 3866677 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0 2084*c42dbd0eSchristos 0.582 67221804 29407054 [37] 4021dd: addsd %xmm0,%xmm1 2085*c42dbd0eSchristos ## 1.871 7279075109 87989870 [36] 4021e1: add $0x1,%rax 2086*c42dbd0eSchristos 0.002 12804210 80351 [36] 4021e5: cmp %rax,%rcx 2087*c42dbd0eSchristos 0. 0 0 [36] 4021e8: jne 0xffffffffffffffea 2088*c42dbd0eSchristos 38. c[i] = row_sum; 2089*c42dbd0eSchristos 0. 0 0 [38] 4021ea: movsd %xmm1,(%r10,%rdi,8) 2090*c42dbd0eSchristos 0. 0 0 [34] 4021f0: add $0x1,%rdi 2091*c42dbd0eSchristos 0. 0 0 [34] 4021f4: cmp %rdi,%rsi 2092*c42dbd0eSchristos 0. 0 0 [34] 4021f7: jb 0xd 2093*c42dbd0eSchristos 0. 0 0 [35] 4021f9: pxor %xmm1,%xmm1 2094*c42dbd0eSchristos 0. 0 0 [36] 4021fd: test %rcx,%rcx 2095*c42dbd0eSchristos 0. 0 80350 [36] 402200: jne 0xffffffffffffffc5 2096*c42dbd0eSchristos 0. 0 0 [36] 402202: jmp 0xffffffffffffffe8 2097*c42dbd0eSchristos 39. } 2098*c42dbd0eSchristos 40. } 2099*c42dbd0eSchristos 0. 0 0 [40] 402204: ret 2100*c42dbd0eSchristos 2101*c42dbd0eSchristos So far we have used the default settings for the event counters. It 2102*c42dbd0eSchristosis quite straightforward to select specific counters. For sake of the 2103*c42dbd0eSchristosexample, let's assume we would like to count how many branch 2104*c42dbd0eSchristosinstructions and retired memory load instructions that missed in the L1 2105*c42dbd0eSchristoscache have been executed. We also want to count these events with a 2106*c42dbd0eSchristoshigh resolution. 2107*c42dbd0eSchristos 2108*c42dbd0eSchristos This is the command to do so: 2109*c42dbd0eSchristos 2110*c42dbd0eSchristos $ exe=mxv-pthreads.exe 2111*c42dbd0eSchristos $ m=3000 2112*c42dbd0eSchristos $ n=2000 2113*c42dbd0eSchristos $ exp=mxv.hwc.sel.2.thr.er 2114*c42dbd0eSchristos $ hwc1=br_ins,hi 2115*c42dbd0eSchristos $ hwc2=mem_load_retired.l1_miss,hi 2116*c42dbd0eSchristos $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2 2117*c42dbd0eSchristos 2118*c42dbd0eSchristos As before, we get a table with the event counts. Due to the very 2119*c42dbd0eSchristoslong name for the second counter, we have somewhat modified the output. 2120*c42dbd0eSchristos 2121*c42dbd0eSchristos $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er 2122*c42dbd0eSchristos 2123*c42dbd0eSchristos Functions sorted by metric: Exclusive Total CPU Time 2124*c42dbd0eSchristos Excl. Incl. Excl. Branch Excl. Name 2125*c42dbd0eSchristos Total Total Instructions mem_load_retired.l1_miss 2126*c42dbd0eSchristos CPU sec. CPU sec. Events 2127*c42dbd0eSchristos 2.597 2.597 1305305319 4021340 <Total> 2128*c42dbd0eSchristos 2.481 2.481 1233233242 3982327 mxv_core 2129*c42dbd0eSchristos 0.040 0.107 19019012 9003 init_data 2130*c42dbd0eSchristos 0.028 0.052 23023048 15006 erand48_r 2131*c42dbd0eSchristos 0.024 0.024 19019008 9004 __drand48_iterate 2132*c42dbd0eSchristos 0.015 0.067 11011009 2998 drand48 2133*c42dbd0eSchristos 0.008 0.010 0 3002 _int_malloc 2134*c42dbd0eSchristos 0.001 0.001 0 0 brk 2135*c42dbd0eSchristos 0.001 0.002 0 0 sysmalloc 2136*c42dbd0eSchristos 0. 0.001 0 0 __default_morecore 2137*c42dbd0eSchristos 2138*c42dbd0eSchristos When using event counters, the values could be very large and it is 2139*c42dbd0eSchristosnot easy to compare the numbers. As we will show next, the 'ratio' 2140*c42dbd0eSchristosfeature is very useful when comparing such profiles. 2141*c42dbd0eSchristos 2142*c42dbd0eSchristos To demonstrate this, we have set up another event counter experiment 2143*c42dbd0eSchristoswhere we would like to compare the number of last level cache miss and 2144*c42dbd0eSchristosthe number of branch instructions executed when using a single thread, 2145*c42dbd0eSchristosor two threads. 2146*c42dbd0eSchristos 2147*c42dbd0eSchristos These are the commands used to generate the experiment directories: 2148*c42dbd0eSchristos 2149*c42dbd0eSchristos $ exe=./mxv-pthreads.exe 2150*c42dbd0eSchristos $ m=3000 2151*c42dbd0eSchristos $ n=2000 2152*c42dbd0eSchristos $ exp1=mxv.hwc.comp.1.thr.er 2153*c42dbd0eSchristos $ exp2=mxv.hwc.comp.2.thr.er 2154*c42dbd0eSchristos $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1 2155*c42dbd0eSchristos $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2 2156*c42dbd0eSchristos 2157*c42dbd0eSchristos The following script has been used to get the tables. Due to lay-out 2158*c42dbd0eSchristosrestrictions, we have to create two tables, one for each counter. 2159*c42dbd0eSchristos 2160*c42dbd0eSchristos # Limit the output to 5 lines 2161*c42dbd0eSchristos limit 5 2162*c42dbd0eSchristos # Define the metrics 2163*c42dbd0eSchristos metrics name:e.llm 2164*c42dbd0eSchristos # Set the comparison to ratio 2165*c42dbd0eSchristos compare ratio 2166*c42dbd0eSchristos functions 2167*c42dbd0eSchristos # Define the metrics 2168*c42dbd0eSchristos metrics name:e.br_ins 2169*c42dbd0eSchristos # Set the comparison to ratio 2170*c42dbd0eSchristos compare ratio 2171*c42dbd0eSchristos functions 2172*c42dbd0eSchristos 2173*c42dbd0eSchristos Note that we print the name of the function first, followed by the 2174*c42dbd0eSchristoscounter data. The new element is that we set the comparison mode to 2175*c42dbd0eSchristos'ratio'. This divides the data in a column by its counterpart in the 2176*c42dbd0eSchristosreference experiment. 2177*c42dbd0eSchristos 2178*c42dbd0eSchristos This is the command using this script and the two experiment 2179*c42dbd0eSchristosdirectories as input: 2180*c42dbd0eSchristos 2181*c42dbd0eSchristos $ gprofng display text -script my-script-comp-counters \ 2182*c42dbd0eSchristos mxv.hwc.comp.1.thr.er \ 2183*c42dbd0eSchristos mxv.hwc.comp.2.thr.er 2184*c42dbd0eSchristos 2185*c42dbd0eSchristos By design, we get two tables, one for each counter: 2186*c42dbd0eSchristos 2187*c42dbd0eSchristos Functions sorted by metric: Exclusive Last-Level Cache Misses 2188*c42dbd0eSchristos 2189*c42dbd0eSchristos mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er 2190*c42dbd0eSchristos Name Excl. Last-Level Excl. Last-Level 2191*c42dbd0eSchristos Cache Misses Cache Misses 2192*c42dbd0eSchristos ratio 2193*c42dbd0eSchristos <Total> 122709276 x 0.788 2194*c42dbd0eSchristos mxv_core 121796001 x 0.787 2195*c42dbd0eSchristos init_data 723064 x 1.055 2196*c42dbd0eSchristos erand48_r 100111 x 0.500 2197*c42dbd0eSchristos drand48 60065 x 1.167 2198*c42dbd0eSchristos 2199*c42dbd0eSchristos Functions sorted by metric: Exclusive Branch Instructions 2200*c42dbd0eSchristos 2201*c42dbd0eSchristos mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er 2202*c42dbd0eSchristos Name Excl. Branch Excl. Branch 2203*c42dbd0eSchristos Instructions Instructions 2204*c42dbd0eSchristos ratio 2205*c42dbd0eSchristos <Total> 1307307316 x 0.997 2206*c42dbd0eSchristos mxv_core 1235235239 x 0.997 2207*c42dbd0eSchristos erand48_r 23023033 x 0.957 2208*c42dbd0eSchristos drand48 20020009 x 0.600 2209*c42dbd0eSchristos __drand48_iterate 17017028 x 0.882 2210*c42dbd0eSchristos 2211*c42dbd0eSchristos A ratio less than one in the second column, means that this counter 2212*c42dbd0eSchristosvalue was smaller than the value from the reference experiment shown in 2213*c42dbd0eSchristosthe first column. 2214*c42dbd0eSchristos 2215*c42dbd0eSchristos This kind of presentation of the results makes it much easier to 2216*c42dbd0eSchristosquickly interpret the data. 2217*c42dbd0eSchristos 2218*c42dbd0eSchristos We conclude this section with thread-level event counter overviews, 2219*c42dbd0eSchristosbut before we go into this, there is an important metric we need to 2220*c42dbd0eSchristosmention. 2221*c42dbd0eSchristos 2222*c42dbd0eSchristos In case it is known how many instructions and CPU cycles have been 2223*c42dbd0eSchristosexecuted, the value for the IPC ("Instructions Per Clockycle") can be 2224*c42dbd0eSchristoscomputed. *Note Hardware Event Counters Explained::. This is a derived 2225*c42dbd0eSchristosmetric that gives an indication how well the processor is utilized. The 2226*c42dbd0eSchristosinverse of the IPC is called CPI. 2227*c42dbd0eSchristos 2228*c42dbd0eSchristos The 'gprofng display text' command automatically computes the IPC and 2229*c42dbd0eSchristosCPI values if an experiment contains the event counter values for the 2230*c42dbd0eSchristosinstructions and CPU cycles executed. These are part of the metric list 2231*c42dbd0eSchristosand can be displayed, just like any other metric. 2232*c42dbd0eSchristos 2233*c42dbd0eSchristos This can be verified through the 'metric_list' command. If we go 2234*c42dbd0eSchristosback to our earlier experiment with the default event counters, we get 2235*c42dbd0eSchristosthe following result. 2236*c42dbd0eSchristos 2237*c42dbd0eSchristos $ gprofng display text -metric_list mxv.hwc.def.2.thr.er 2238*c42dbd0eSchristos 2239*c42dbd0eSchristos Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name 2240*c42dbd0eSchristos Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) 2241*c42dbd0eSchristos Available metrics: 2242*c42dbd0eSchristos Exclusive Total CPU Time: e.%totalcpu 2243*c42dbd0eSchristos Inclusive Total CPU Time: i.%totalcpu 2244*c42dbd0eSchristos Exclusive CPU Cycles: e.+%cycles 2245*c42dbd0eSchristos Inclusive CPU Cycles: i.+%cycles 2246*c42dbd0eSchristos Exclusive Instructions Executed: e+%insts 2247*c42dbd0eSchristos Inclusive Instructions Executed: i+%insts 2248*c42dbd0eSchristos Exclusive Last-Level Cache Misses: e+%llm 2249*c42dbd0eSchristos Inclusive Last-Level Cache Misses: i+%llm 2250*c42dbd0eSchristos Exclusive Instructions Per Cycle: e+IPC 2251*c42dbd0eSchristos Inclusive Instructions Per Cycle: i+IPC 2252*c42dbd0eSchristos Exclusive Cycles Per Instruction: e+CPI 2253*c42dbd0eSchristos Inclusive Cycles Per Instruction: i+CPI 2254*c42dbd0eSchristos Size: size 2255*c42dbd0eSchristos PC Address: address 2256*c42dbd0eSchristos Name: name 2257*c42dbd0eSchristos 2258*c42dbd0eSchristos Among the other metrics, we see the new metrics for the IPC and CPI 2259*c42dbd0eSchristoslisted. 2260*c42dbd0eSchristos 2261*c42dbd0eSchristos In the script below, we use this information and add the IPC and CPI 2262*c42dbd0eSchristosto the metrics to be displayed. We also use a the thread filter to 2263*c42dbd0eSchristosdisplay these values for the individual threads. 2264*c42dbd0eSchristos 2265*c42dbd0eSchristos This is the complete script we have used. Other than a different 2266*c42dbd0eSchristosselection of the metrics, there are no new features. 2267*c42dbd0eSchristos 2268*c42dbd0eSchristos # Define the metrics 2269*c42dbd0eSchristos metrics e.insts:e.%cycles:e.IPC:e.CPI 2270*c42dbd0eSchristos # Sort with respect to cycles 2271*c42dbd0eSchristos sort e.cycles 2272*c42dbd0eSchristos # Limit the output to 5 lines 2273*c42dbd0eSchristos limit 5 2274*c42dbd0eSchristos # Get the function overview for all threads 2275*c42dbd0eSchristos functions 2276*c42dbd0eSchristos # Get the function overview for thread 1 2277*c42dbd0eSchristos thread_select 1 2278*c42dbd0eSchristos functions 2279*c42dbd0eSchristos # Get the function overview for thread 2 2280*c42dbd0eSchristos thread_select 2 2281*c42dbd0eSchristos functions 2282*c42dbd0eSchristos # Get the function overview for thread 3 2283*c42dbd0eSchristos thread_select 3 2284*c42dbd0eSchristos functions 2285*c42dbd0eSchristos 2286*c42dbd0eSchristos In the metrics definition on the second line, we explicitly request 2287*c42dbd0eSchristosthe counter values for the instructions ('e.insts') and CPU cycles 2288*c42dbd0eSchristos('e.cycles') executed. These names can be found in output from the 2289*c42dbd0eSchristos'metric_list' commad above. In addition to these metrics, we also 2290*c42dbd0eSchristosrequest the IPC and CPI to be shown. 2291*c42dbd0eSchristos 2292*c42dbd0eSchristos As before, we used the 'limit' command to control the number of 2293*c42dbd0eSchristosfunctions displayed. We then request an overview for all the threads, 2294*c42dbd0eSchristosfollowed by three sets of two commands to select a thread and display 2295*c42dbd0eSchristosthe function overview. 2296*c42dbd0eSchristos 2297*c42dbd0eSchristos The script above is used as follows: 2298*c42dbd0eSchristos 2299*c42dbd0eSchristos $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er 2300*c42dbd0eSchristos 2301*c42dbd0eSchristos This script produces four tables. We list them separately below, and 2302*c42dbd0eSchristoshave left out the additional output. 2303*c42dbd0eSchristos 2304*c42dbd0eSchristos The first table shows the accumulated values across the three threads 2305*c42dbd0eSchristosthat have been active. 2306*c42dbd0eSchristos 2307*c42dbd0eSchristos Functions sorted by metric: Exclusive CPU Cycles 2308*c42dbd0eSchristos 2309*c42dbd0eSchristos Excl. Excl. CPU Excl. Excl. Name 2310*c42dbd0eSchristos Instructions Cycles IPC CPI 2311*c42dbd0eSchristos Executed sec. % 2312*c42dbd0eSchristos 7906475309 2.691 100.00 1.473 0.679 <Total> 2313*c42dbd0eSchristos 7432724378 2.598 96.54 1.434 0.697 mxv_core 2314*c42dbd0eSchristos 188860269 0.035 1.31 2.682 0.373 erand48_r 2315*c42dbd0eSchristos 73623396 0.026 0.95 1.438 0.696 init_data 2316*c42dbd0eSchristos 76824434 0.018 0.66 2.182 0.458 drand48 2317*c42dbd0eSchristos 2318*c42dbd0eSchristos This shows that IPC of this program is completely dominated by 2319*c42dbd0eSchristosfunction 'mxv_core'. It has a fairly low IPC value of 1.43. 2320*c42dbd0eSchristos 2321*c42dbd0eSchristos The next table is for thread 1 and shows the values for the main 2322*c42dbd0eSchristosthread. 2323*c42dbd0eSchristos 2324*c42dbd0eSchristos Exp Sel Total 2325*c42dbd0eSchristos === === ===== 2326*c42dbd0eSchristos 1 1 3 2327*c42dbd0eSchristos Functions sorted by metric: Exclusive CPU Cycles 2328*c42dbd0eSchristos 2329*c42dbd0eSchristos Excl. Excl. CPU Excl. Excl. Name 2330*c42dbd0eSchristos Instructions Cycles IPC CPI 2331*c42dbd0eSchristos Executed sec. % 2332*c42dbd0eSchristos 473750931 0.093 100.00 2.552 0.392 <Total> 2333*c42dbd0eSchristos 188860269 0.035 37.93 2.682 0.373 erand48_r 2334*c42dbd0eSchristos 73623396 0.026 27.59 1.438 0.696 init_data 2335*c42dbd0eSchristos 76824434 0.018 18.97 2.182 0.458 drand48 2336*c42dbd0eSchristos 134442832 0.013 13.79 5.250 0.190 __drand48_iterate 2337*c42dbd0eSchristos 2338*c42dbd0eSchristos Although this thread hardly uses any CPU cycles, the overall IPC of 2339*c42dbd0eSchristos2.55 is not all that bad. 2340*c42dbd0eSchristos 2341*c42dbd0eSchristos Last, we show the tables for threads 2 and 3: 2342*c42dbd0eSchristos 2343*c42dbd0eSchristos Exp Sel Total 2344*c42dbd0eSchristos === === ===== 2345*c42dbd0eSchristos 1 2 3 2346*c42dbd0eSchristos Functions sorted by metric: Exclusive CPU Cycles 2347*c42dbd0eSchristos 2348*c42dbd0eSchristos Excl. Excl. CPU Excl. Excl. Name 2349*c42dbd0eSchristos Instructions Cycles IPC CPI 2350*c42dbd0eSchristos Executed sec. % 2351*c42dbd0eSchristos 3716362189 1.298 100.00 1.435 0.697 <Total> 2352*c42dbd0eSchristos 3716362189 1.298 100.00 1.435 0.697 mxv_core 2353*c42dbd0eSchristos 0 0. 0. 0. 0. collector_root 2354*c42dbd0eSchristos 0 0. 0. 0. 0. driver_mxv 2355*c42dbd0eSchristos 2356*c42dbd0eSchristos Exp Sel Total 2357*c42dbd0eSchristos === === ===== 2358*c42dbd0eSchristos 1 3 3 2359*c42dbd0eSchristos Functions sorted by metric: Exclusive CPU Cycles 2360*c42dbd0eSchristos 2361*c42dbd0eSchristos Excl. Excl. CPU Excl. Excl. Name 2362*c42dbd0eSchristos Instructions Cycles IPC CPI 2363*c42dbd0eSchristos Executed sec. % 2364*c42dbd0eSchristos 3716362189 1.300 100.00 1.433 0.698 <Total> 2365*c42dbd0eSchristos 3716362189 1.300 100.00 1.433 0.698 mxv_core 2366*c42dbd0eSchristos 0 0. 0. 0. 0. collector_root 2367*c42dbd0eSchristos 0 0. 0. 0. 0. driver_mxv 2368*c42dbd0eSchristos 2369*c42dbd0eSchristos It is seen that both execute the same number of instructions and take 2370*c42dbd0eSchristosabout the same number of CPU cycles. As a result, the IPC is the same 2371*c42dbd0eSchristosfor both threads. 2372*c42dbd0eSchristos 2373*c42dbd0eSchristos 2374*c42dbd0eSchristosFile: gprofng.info, Node: Java Profiling, Prev: Profile Hardware Event Counters, Up: A Mini Tutorial 2375*c42dbd0eSchristos 2376*c42dbd0eSchristos3.5 Java Profiling 2377*c42dbd0eSchristos================== 2378*c42dbd0eSchristos 2379*c42dbd0eSchristosThe 'gprofng collect app' command supports Java profiling. The '-j on' 2380*c42dbd0eSchristosoption can be used for this, but since this feature is enabled by 2381*c42dbd0eSchristosdefault, there is no need to set this explicitly. Java profiling may be 2382*c42dbd0eSchristosdisabled through the '-j off' option. 2383*c42dbd0eSchristos 2384*c42dbd0eSchristos The program is compiled as usual and the experiment directory is 2385*c42dbd0eSchristoscreated similar to what we have seen before. The only difference with a 2386*c42dbd0eSchristosC/C++ application is that the program has to be explicitly executed by 2387*c42dbd0eSchristosjava. 2388*c42dbd0eSchristos 2389*c42dbd0eSchristos For example, this is how to generate the experiment data for a Java 2390*c42dbd0eSchristosprogram that has the source code stored in file 'Pi.java': 2391*c42dbd0eSchristos 2392*c42dbd0eSchristos $ javac Pi.java 2393*c42dbd0eSchristos $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in 2394*c42dbd0eSchristos 2395*c42dbd0eSchristos Regarding which java is selected to generate the data, 'gprofng' 2396*c42dbd0eSchristosfirst looks for the JDK in the path set in either the 'JDK_HOME' 2397*c42dbd0eSchristosenvironment variable, or in the 'JAVA_PATH' environment variable. If 2398*c42dbd0eSchristosneither of these variables is set, it checks for a JDK in the search 2399*c42dbd0eSchristospath (set in the PATH environment variable). If there is no JDK in this 2400*c42dbd0eSchristospath, it checks for the java executable in '/usr/java/bin/java'. 2401*c42dbd0eSchristos 2402*c42dbd0eSchristos In case additional options need to be passed on to the JVM, the '-J 2403*c42dbd0eSchristos<string>' option can be used. The string with the option(s) has to be 2404*c42dbd0eSchristosdelimited by quotation marks in case there is more than one argument. 2405*c42dbd0eSchristos 2406*c42dbd0eSchristos The 'gprofng display text' command may be used to view the 2407*c42dbd0eSchristosperformance data. There is no need for any special options and the same 2408*c42dbd0eSchristoscommands as previously discussed are supported. 2409*c42dbd0eSchristos 2410*c42dbd0eSchristos The 'viewmode' command *Note The Viewmode:: is very useful to examine 2411*c42dbd0eSchristosthe call stacks. 2412*c42dbd0eSchristos 2413*c42dbd0eSchristos For example, this is how one can see the native call stacks. For 2414*c42dbd0eSchristoslay-out purposes we have restricted the list to the first five entries: 2415*c42dbd0eSchristos 2416*c42dbd0eSchristos $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er 2417*c42dbd0eSchristos 2418*c42dbd0eSchristos Print limit set to 5 2419*c42dbd0eSchristos Viewmode set to machine 2420*c42dbd0eSchristos Functions Call Tree. Metric: Attributed Total CPU Time 2421*c42dbd0eSchristos 2422*c42dbd0eSchristos Attr. Name 2423*c42dbd0eSchristos Total 2424*c42dbd0eSchristos CPU sec. 2425*c42dbd0eSchristos 1.381 +-<Total> 2426*c42dbd0eSchristos 1.171 +-Pi.calculatePi(double) 2427*c42dbd0eSchristos 0.110 +-collector_root 2428*c42dbd0eSchristos 0.110 | +-JavaMain 2429*c42dbd0eSchristos 0.070 | +-jni_CallStaticVoidMethod 2430*c42dbd0eSchristos 2431*c42dbd0eSchristosNote that the selection of the viewmode is echoed in the output. 2432*c42dbd0eSchristos 2433*c42dbd0eSchristos 2434*c42dbd0eSchristosFile: gprofng.info, Node: Terminology, Next: Other Document Formats, Prev: A Mini Tutorial, Up: Top 2435*c42dbd0eSchristos 2436*c42dbd0eSchristos4 Terminology 2437*c42dbd0eSchristos************* 2438*c42dbd0eSchristos 2439*c42dbd0eSchristosThroughout this manual, certain terminology specific to profiling tools, 2440*c42dbd0eSchristosor 'gprofng', or even to this document only, is used. In this chapter 2441*c42dbd0eSchristoswe explain this terminology in detail. 2442*c42dbd0eSchristos 2443*c42dbd0eSchristos* Menu: 2444*c42dbd0eSchristos 2445*c42dbd0eSchristos* The Program Counter:: What is a Program Counter? 2446*c42dbd0eSchristos* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics. 2447*c42dbd0eSchristos* Metric Definitions:: Definitions associated with metrics. 2448*c42dbd0eSchristos* The Viewmode:: Select the way call stacks are presented. 2449*c42dbd0eSchristos* The Selection List:: How to define a selection. 2450*c42dbd0eSchristos* Load Objects and Functions:: The components in an application. 2451*c42dbd0eSchristos* The Concept of a CPU in gprofng:: The definition of a CPU. 2452*c42dbd0eSchristos* Hardware Event Counters Explained:: What are event counters? 2453*c42dbd0eSchristos* apath:: Our generic definition of a path. 2454*c42dbd0eSchristos 2455*c42dbd0eSchristos 2456*c42dbd0eSchristosFile: gprofng.info, Node: The Program Counter, Next: Inclusive and Exclusive Metrics, Up: Terminology 2457*c42dbd0eSchristos 2458*c42dbd0eSchristos4.1 The Program Counter 2459*c42dbd0eSchristos======================= 2460*c42dbd0eSchristos 2461*c42dbd0eSchristosThe _Program Counter_, or PC for short, keeps track where program 2462*c42dbd0eSchristosexecution is. The address of the next instruction to be executed is 2463*c42dbd0eSchristosstored in a special purpose register in the processor, or core. 2464*c42dbd0eSchristos 2465*c42dbd0eSchristos The PC is sometimes also referred to as the _instruction pointer_, 2466*c42dbd0eSchristosbut we will use Program Counter or PC throughout this document. 2467*c42dbd0eSchristos 2468*c42dbd0eSchristos 2469*c42dbd0eSchristosFile: gprofng.info, Node: Inclusive and Exclusive Metrics, Next: Metric Definitions, Prev: The Program Counter, Up: Terminology 2470*c42dbd0eSchristos 2471*c42dbd0eSchristos4.2 Inclusive and Exclusive Metrics 2472*c42dbd0eSchristos=================================== 2473*c42dbd0eSchristos 2474*c42dbd0eSchristosIn the remainder, these two concepts occur quite often and for lack of a 2475*c42dbd0eSchristosbetter place, they are explained here. 2476*c42dbd0eSchristos 2477*c42dbd0eSchristos The _inclusive_ value for a metric includes all values that are part 2478*c42dbd0eSchristosof the dynamic extent of the target function. For example if function 2479*c42dbd0eSchristos'A' calls functions 'B' and 'C', the inclusive CPU time for 'A' includes 2480*c42dbd0eSchristosthe CPU time spent in 'B' and 'C'. 2481*c42dbd0eSchristos 2482*c42dbd0eSchristos In contrast with this, the _exclusive_ value for a metric is computed 2483*c42dbd0eSchristosby excluding the metric values used by other functions called. In our 2484*c42dbd0eSchristosimaginary example, the exclusive CPU time for function 'A' is the time 2485*c42dbd0eSchristosspent outside calling functions 'B' and 'C'. 2486*c42dbd0eSchristos 2487*c42dbd0eSchristos In case of a _leaf function_, the inclusive and exclusive values for 2488*c42dbd0eSchristosthe metric are the same since by definition, it is not calling any other 2489*c42dbd0eSchristosfunction(s). 2490*c42dbd0eSchristos 2491*c42dbd0eSchristos Why do we use these two different values? The inclusive metric shows 2492*c42dbd0eSchristosthe most expensive path, in terms of this metric, in the application. 2493*c42dbd0eSchristosFor example, if the metric is cache misses, the function with the 2494*c42dbd0eSchristoshighest inclusive metric tells you where most of the cache misses come 2495*c42dbd0eSchristosfrom. 2496*c42dbd0eSchristos 2497*c42dbd0eSchristos Within this branch of the application, the exclusive metric points to 2498*c42dbd0eSchristosthe functions that contribute and help to identify which part(s) to 2499*c42dbd0eSchristosconsider for further analysis. 2500*c42dbd0eSchristos 2501*c42dbd0eSchristos 2502*c42dbd0eSchristosFile: gprofng.info, Node: Metric Definitions, Next: The Viewmode, Prev: Inclusive and Exclusive Metrics, Up: Terminology 2503*c42dbd0eSchristos 2504*c42dbd0eSchristos4.3 Metric Definitions 2505*c42dbd0eSchristos====================== 2506*c42dbd0eSchristos 2507*c42dbd0eSchristosThe metrics to be shown are highly customizable. In this section we 2508*c42dbd0eSchristosexplain the definitions associated with metrics. 2509*c42dbd0eSchristos 2510*c42dbd0eSchristos The 'metrics' command takes a colon (:) separated list with special 2511*c42dbd0eSchristoskeywords. This keyword consists of the following three fields: 2512*c42dbd0eSchristos'<flavor>''<visibility>''<metric_name>'. 2513*c42dbd0eSchristos 2514*c42dbd0eSchristos The _<flavor>_ field is either an 'e' for "exclusive", or 'i' for 2515*c42dbd0eSchristos"inclusive". The '<metric_name>' field is the name of the metric 2516*c42dbd0eSchristosrequest. The _<visibility>_ field consists of one ore more characters 2517*c42dbd0eSchristosfrom the following table: 2518*c42dbd0eSchristos 2519*c42dbd0eSchristos'.' 2520*c42dbd0eSchristos Show the metric as time. This applies to timing metrics and 2521*c42dbd0eSchristos hardware event counters that measure cycles. Interpret as '+' for 2522*c42dbd0eSchristos other metrics. 2523*c42dbd0eSchristos 2524*c42dbd0eSchristos'%' 2525*c42dbd0eSchristos Show the metric as a percentage of the total value for this metric. 2526*c42dbd0eSchristos 2527*c42dbd0eSchristos'+' 2528*c42dbd0eSchristos Show the metric as an absolute value. For hardware event counters 2529*c42dbd0eSchristos this is the event count. Interpret as '.' for timing metrics. 2530*c42dbd0eSchristos 2531*c42dbd0eSchristos'|' 2532*c42dbd0eSchristos Do not show any metric value. Cannot be used with other visibility 2533*c42dbd0eSchristos characters. 2534*c42dbd0eSchristos 2535*c42dbd0eSchristos 2536*c42dbd0eSchristosFile: gprofng.info, Node: The Viewmode, Next: The Selection List, Prev: Metric Definitions, Up: Terminology 2537*c42dbd0eSchristos 2538*c42dbd0eSchristos4.4 The Viewmode 2539*c42dbd0eSchristos================ 2540*c42dbd0eSchristos 2541*c42dbd0eSchristosThere are different ways to view a call stack in Java. In 'gprofng', 2542*c42dbd0eSchristosthis is called the _viewmode_ and the setting is controlled through a 2543*c42dbd0eSchristoscommand with the same name. 2544*c42dbd0eSchristos 2545*c42dbd0eSchristos The 'viewmode' command takes one of the following keywords: 2546*c42dbd0eSchristos 2547*c42dbd0eSchristos'user' 2548*c42dbd0eSchristos This is the default and shows the Java call stacks for Java 2549*c42dbd0eSchristos threads. No call stacks for any housekeeping threads are shown. 2550*c42dbd0eSchristos The function list contains a function '<JVM-System>' that 2551*c42dbd0eSchristos represents the aggregated time from non-Java threads. When the JVM 2552*c42dbd0eSchristos software does not report a Java call stack, time is reported 2553*c42dbd0eSchristos against the function '<no Java callstack recorded>'. 2554*c42dbd0eSchristos 2555*c42dbd0eSchristos'expert' 2556*c42dbd0eSchristos Show the Java call stacks for Java threads when the Java code from 2557*c42dbd0eSchristos the user is executed and machine call stacks when JVM code is 2558*c42dbd0eSchristos executed, or when the JVM software does not report a Java call 2559*c42dbd0eSchristos stack. Show the machine call stacks for housekeeping threads. 2560*c42dbd0eSchristos 2561*c42dbd0eSchristos'machine' 2562*c42dbd0eSchristos Show the actual native call stacks for all threads. 2563*c42dbd0eSchristos 2564*c42dbd0eSchristos 2565*c42dbd0eSchristosFile: gprofng.info, Node: The Selection List, Next: Load Objects and Functions, Prev: The Viewmode, Up: Terminology 2566*c42dbd0eSchristos 2567*c42dbd0eSchristos4.5 The Selection List 2568*c42dbd0eSchristos====================== 2569*c42dbd0eSchristos 2570*c42dbd0eSchristosSeveral commands allow the user to specify a subset of a list. For 2571*c42dbd0eSchristosexample, to select specific threads from all the threads that have been 2572*c42dbd0eSchristosused when conducting the experiment(s). 2573*c42dbd0eSchristos 2574*c42dbd0eSchristos Such a selection list (or "list" in the remainder of this section) 2575*c42dbd0eSchristoscan be a single number, a contiguous range of numbers with the start and 2576*c42dbd0eSchristosend numbers separated by a hyphen ('-'), a comma-separated list of 2577*c42dbd0eSchristosnumbers and ranges, or the 'all' keyword. Lists must not contain 2578*c42dbd0eSchristosspaces. 2579*c42dbd0eSchristos 2580*c42dbd0eSchristos Each list can optionally be preceded by an experiment list with a 2581*c42dbd0eSchristossimilar format, separated from the list by a colon (:). If no 2582*c42dbd0eSchristosexperiment list is included, the list applies to all experiments. 2583*c42dbd0eSchristos 2584*c42dbd0eSchristos Multiple lists can be concatenated by separating the individual lists 2585*c42dbd0eSchristosby a plus sign. 2586*c42dbd0eSchristos 2587*c42dbd0eSchristos These are some examples of various filters using a list: 2588*c42dbd0eSchristos 2589*c42dbd0eSchristos'thread_select 1' 2590*c42dbd0eSchristos Select thread 1 from all experiments. 2591*c42dbd0eSchristos 2592*c42dbd0eSchristos'thread_select all:1' 2593*c42dbd0eSchristos Select thread 1 from all experiments. 2594*c42dbd0eSchristos 2595*c42dbd0eSchristos'thread_select 1:1+2:2' 2596*c42dbd0eSchristos Select thread 1 from experiment 1 and thread 2 from experiment 2. 2597*c42dbd0eSchristos 2598*c42dbd0eSchristos'cpu_select all:1,3,5' 2599*c42dbd0eSchristos Selects cores 1, 3, and 5 from all experiments. 2600*c42dbd0eSchristos 2601*c42dbd0eSchristos'cpu_select 1,2:all' 2602*c42dbd0eSchristos Select all cores from experiments 1 and 2, as listed by the 'by 2603*c42dbd0eSchristos exp_list' command. 2604*c42dbd0eSchristos 2605*c42dbd0eSchristos 2606*c42dbd0eSchristosFile: gprofng.info, Node: Load Objects and Functions, Next: The Concept of a CPU in gprofng, Prev: The Selection List, Up: Terminology 2607*c42dbd0eSchristos 2608*c42dbd0eSchristos4.6 Load Objects and Functions 2609*c42dbd0eSchristos============================== 2610*c42dbd0eSchristos 2611*c42dbd0eSchristosAn application consists of various components. The source code files 2612*c42dbd0eSchristosare compiled into object files. These are then glued together at link 2613*c42dbd0eSchristostime to form the executable. During execution, the program may also 2614*c42dbd0eSchristosdynamically load objects. 2615*c42dbd0eSchristos 2616*c42dbd0eSchristos A _load object_ is defined to be an executable, or shared object. A 2617*c42dbd0eSchristosshared library is an example of a load object in 'gprofng'. 2618*c42dbd0eSchristos 2619*c42dbd0eSchristos Each load object, contains a text section with the instructions 2620*c42dbd0eSchristosgenerated by the compiler, a data section for data, and various symbol 2621*c42dbd0eSchristostables. All load objects must contain an ELF symbol table, which gives 2622*c42dbd0eSchristosthe names and addresses of all the globally known functions in that 2623*c42dbd0eSchristosobject. 2624*c42dbd0eSchristos 2625*c42dbd0eSchristos Load objects compiled with the -g option contain additional symbolic 2626*c42dbd0eSchristosinformation that can augment the ELF symbol table and provide 2627*c42dbd0eSchristosinformation about functions that are not global, additional information 2628*c42dbd0eSchristosabout object modules from which the functions came, and line number 2629*c42dbd0eSchristosinformation relating addresses to source lines. 2630*c42dbd0eSchristos 2631*c42dbd0eSchristos The term _function_ is used to describe a set of instructions that 2632*c42dbd0eSchristosrepresent a high-level operation described in the source code. The term 2633*c42dbd0eSchristosalso covers methods as used in C++ and in the Java programming language. 2634*c42dbd0eSchristos 2635*c42dbd0eSchristos In the 'gprofng' context, functions are provided in source code 2636*c42dbd0eSchristosformat. Normally their names appear in the symbol table representing a 2637*c42dbd0eSchristosset of addresses. If the Program Counter (PC) is within that set, the 2638*c42dbd0eSchristosprogram is executing within that function. 2639*c42dbd0eSchristos 2640*c42dbd0eSchristos In principle, any address within the text segment of a load object 2641*c42dbd0eSchristoscan be mapped to a function. Exactly the same mapping is used for the 2642*c42dbd0eSchristosleaf PC and all the other PCs on the call stack. 2643*c42dbd0eSchristos 2644*c42dbd0eSchristos Most of the functions correspond directly to the source model of the 2645*c42dbd0eSchristosprogram, but there are exceptions. This topic is however outside of the 2646*c42dbd0eSchristosscope of this guide. 2647*c42dbd0eSchristos 2648*c42dbd0eSchristos 2649*c42dbd0eSchristosFile: gprofng.info, Node: The Concept of a CPU in gprofng, Next: Hardware Event Counters Explained, Prev: Load Objects and Functions, Up: Terminology 2650*c42dbd0eSchristos 2651*c42dbd0eSchristos4.7 The Concept of a CPU in gprofng 2652*c42dbd0eSchristos=================================== 2653*c42dbd0eSchristos 2654*c42dbd0eSchristosIn gprofng, there is the concept of a CPU. Admittedly, this is not the 2655*c42dbd0eSchristosbest word to describe what is meant here and may be replaced in the 2656*c42dbd0eSchristosfuture. 2657*c42dbd0eSchristos 2658*c42dbd0eSchristos The word CPU is used in many of the displays. In the context of 2659*c42dbd0eSchristosgprofng, it is meant to denote a part of the processor that is capable 2660*c42dbd0eSchristosof executing instructions and with its own state, like the program 2661*c42dbd0eSchristoscounter. 2662*c42dbd0eSchristos 2663*c42dbd0eSchristos For example, on a contemporary processor, a CPU could be a core. In 2664*c42dbd0eSchristoscase hardware threads are supported within a core, it could be one of 2665*c42dbd0eSchristosthose hardware threads. 2666*c42dbd0eSchristos 2667*c42dbd0eSchristos 2668*c42dbd0eSchristosFile: gprofng.info, Node: Hardware Event Counters Explained, Next: apath, Prev: The Concept of a CPU in gprofng, Up: Terminology 2669*c42dbd0eSchristos 2670*c42dbd0eSchristos4.8 Hardware Event Counters Explained 2671*c42dbd0eSchristos===================================== 2672*c42dbd0eSchristos 2673*c42dbd0eSchristosFor quite a number of years now, many microprocessors have supported 2674*c42dbd0eSchristoshardware event counters. 2675*c42dbd0eSchristos 2676*c42dbd0eSchristos On the hardware side, this means that in the processor there are one 2677*c42dbd0eSchristosor more registers dedicated to count certain activities, or "events". 2678*c42dbd0eSchristosExamples of such events are the number of instructions executed, or the 2679*c42dbd0eSchristosnumber of cache misses at level 2 in the memory hierarchy. 2680*c42dbd0eSchristos 2681*c42dbd0eSchristos While there is a limited set of such registers, the user can map 2682*c42dbd0eSchristosevents onto them. In case more than one register is available, this 2683*c42dbd0eSchristosallows for the simultaenous measurement of various events. 2684*c42dbd0eSchristos 2685*c42dbd0eSchristos A simple, yet powerful, example is to simultaneously count the number 2686*c42dbd0eSchristosof CPU cycles and the number of instructions excuted. These two numbers 2687*c42dbd0eSchristoscan then be used to compute the _IPC_ value. IPC stands for 2688*c42dbd0eSchristos"Instructions Per Clockcycle" and each processor has a maximum. For 2689*c42dbd0eSchristosexample, if this maximum number is 2, it means the processor is capable 2690*c42dbd0eSchristosof executing two instructions every clock cycle. 2691*c42dbd0eSchristos 2692*c42dbd0eSchristos Whether this is actually achieved, depends on several factors, 2693*c42dbd0eSchristosincluding the instruction characteristics. However, in case the IPC 2694*c42dbd0eSchristosvalue is well below this maximum in a time critical part of the 2695*c42dbd0eSchristosapplication and this cannot be easily explained, further investigation 2696*c42dbd0eSchristosis probably warranted. 2697*c42dbd0eSchristos 2698*c42dbd0eSchristos A related metric is called _CPI_, or "Clockcycles Per Instruction". 2699*c42dbd0eSchristosIt is the inverse of the CPI and can be compared against the theoretical 2700*c42dbd0eSchristosvalue(s) of the target instruction(s). A significant difference may 2701*c42dbd0eSchristospoint at a bottleneck. 2702*c42dbd0eSchristos 2703*c42dbd0eSchristos One thing to keep in mind is that the value returned by a counter can 2704*c42dbd0eSchristoseither be the number of times the event occured, or a CPU cycle count. 2705*c42dbd0eSchristosIn case of the latter it is possible to convert this number to time. 2706*c42dbd0eSchristos 2707*c42dbd0eSchristos This is often easier to interpret than a simple count, but there is 2708*c42dbd0eSchristosone caveat to keep in mind. The CPU frequency may not have been 2709*c42dbd0eSchristosconstant while the experimen was recorded and this impacts the time 2710*c42dbd0eSchristosreported. 2711*c42dbd0eSchristos 2712*c42dbd0eSchristos These event counters, or "counters" for short, provide great insight 2713*c42dbd0eSchristosinto what happens deep inside the processor. In case higher level 2714*c42dbd0eSchristosinformation does not provide the insight needed, the counters provide 2715*c42dbd0eSchristosthe information to get to the bottom of a performance problem. 2716*c42dbd0eSchristos 2717*c42dbd0eSchristos There are some things to consider though. 2718*c42dbd0eSchristos 2719*c42dbd0eSchristos * The event definitions and names vary across processors and it may 2720*c42dbd0eSchristos even happen that some events change with an update. Unfortunately 2721*c42dbd0eSchristos and this is luckily rare, there are sometimes bugs causing the 2722*c42dbd0eSchristos wrong count to be returned. 2723*c42dbd0eSchristos 2724*c42dbd0eSchristos In 'gprofng', some of the processor specific event names have an 2725*c42dbd0eSchristos alias name. For example 'insts' measures the instructions 2726*c42dbd0eSchristos executed. These aliases not only makes it easier to identify the 2727*c42dbd0eSchristos functionality, but also provide portability of certain events 2728*c42dbd0eSchristos across processors. 2729*c42dbd0eSchristos 2730*c42dbd0eSchristos * Another complexity is that there are typically many events one can 2731*c42dbd0eSchristos monitor. There may up to hundreds of events available and it could 2732*c42dbd0eSchristos require several experiments to zoom in on the root cause of a 2733*c42dbd0eSchristos performance problem. 2734*c42dbd0eSchristos 2735*c42dbd0eSchristos * There may be restrictions regarding the mapping of event(s) onto 2736*c42dbd0eSchristos the counters. For example, certain events may be restricted to 2737*c42dbd0eSchristos specific counters only. As a result, one may have to conduct 2738*c42dbd0eSchristos additional experiments to cover all the events of interest. 2739*c42dbd0eSchristos 2740*c42dbd0eSchristos * The names of the events may also not be easy to interpret. In such 2741*c42dbd0eSchristos cases, the description can be found in the architecture manual for 2742*c42dbd0eSchristos the processor. 2743*c42dbd0eSchristos 2744*c42dbd0eSchristos Despite these drawbacks, hardware event counters are extremely useful 2745*c42dbd0eSchristosand may even turn out to be indispensable. 2746*c42dbd0eSchristos 2747*c42dbd0eSchristos 2748*c42dbd0eSchristosFile: gprofng.info, Node: apath, Prev: Hardware Event Counters Explained, Up: Terminology 2749*c42dbd0eSchristos 2750*c42dbd0eSchristos4.9 What is <apath>? 2751*c42dbd0eSchristos==================== 2752*c42dbd0eSchristos 2753*c42dbd0eSchristosIn most cases, 'gprofng' shows the absolute pathnames of directories. 2754*c42dbd0eSchristosThese tend to be rather long, causing display issues in this document. 2755*c42dbd0eSchristos 2756*c42dbd0eSchristos Instead of wrapping these long pathnames over multiple lines, we 2757*c42dbd0eSchristosdecided to represent them by the '<apath>' symbol, which stands for "an 2758*c42dbd0eSchristosabsolute pathname". 2759*c42dbd0eSchristos 2760*c42dbd0eSchristos Note that different occurrences of '<apath>' may represent different 2761*c42dbd0eSchristosabsolute pathnames. 2762*c42dbd0eSchristos 2763*c42dbd0eSchristos 2764*c42dbd0eSchristosFile: gprofng.info, Node: Other Document Formats, Next: Index, Prev: Terminology, Up: Top 2765*c42dbd0eSchristos 2766*c42dbd0eSchristos5 Other Document Formats 2767*c42dbd0eSchristos************************ 2768*c42dbd0eSchristos 2769*c42dbd0eSchristosThis document is written in Texinfo and the source text is made 2770*c42dbd0eSchristosavailable as part of the binutils distribution. The file name is 2771*c42dbd0eSchristos'gprofng.texi' and can be found in subdirectory 'doc' under directory 2772*c42dbd0eSchristos'gprofng' in the top level directory. 2773*c42dbd0eSchristos 2774*c42dbd0eSchristos This file can be used to generate the document in the 'info', 'html', 2775*c42dbd0eSchristosand 'pdf' formats. The default installation procedure creates a file in 2776*c42dbd0eSchristosthe 'info' format and stores it in the documentation section of 2777*c42dbd0eSchristosbinutils. 2778*c42dbd0eSchristos 2779*c42dbd0eSchristos The probably easiest way to generate a different format from this 2780*c42dbd0eSchristosTexinfo document is to go to the distribution directory that was created 2781*c42dbd0eSchristoswhen the tools were built. This is either the default distribution 2782*c42dbd0eSchristosdirectory, or the one that has been set with the '--prefix' option as 2783*c42dbd0eSchristospart of the 'configure' command. In this example we symbolize this 2784*c42dbd0eSchristoslocation with '<dist>'. 2785*c42dbd0eSchristos 2786*c42dbd0eSchristos The make file called 'Makefile' in directory '<dist>/gprofng/doc' 2787*c42dbd0eSchristossupports several commands to generate this document in different 2788*c42dbd0eSchristosformats. We recommend to use these commands. 2789*c42dbd0eSchristos 2790*c42dbd0eSchristos They create the file(s) and install it in the documentation directory 2791*c42dbd0eSchristosof binutils, which is '<dist>/share/doc' in case 'html' or 'pdf' is 2792*c42dbd0eSchristosselected and '<dist>/share/info' for the file in the 'info' format. 2793*c42dbd0eSchristos 2794*c42dbd0eSchristos To generate this document in the requested format and install it in 2795*c42dbd0eSchristosthe documentation directory, the commands below should be executed. In 2796*c42dbd0eSchristosthis notation, '<format>' is one of 'info', 'html', or 'pdf': 2797*c42dbd0eSchristos 2798*c42dbd0eSchristos $ cd <dist>/gprofng/doc 2799*c42dbd0eSchristos $ make install-<format> 2800*c42dbd0eSchristos 2801*c42dbd0eSchristosSome things to note: 2802*c42dbd0eSchristos 2803*c42dbd0eSchristos * For the 'pdf' file to be generated, the TeX document formatting 2804*c42dbd0eSchristos software is required and the relevant commmands need to be included 2805*c42dbd0eSchristos in the search path. An example of a popular TeX implementation is 2806*c42dbd0eSchristos _TexLive_. It is beyond the scope of this document to go into the 2807*c42dbd0eSchristos details of installing and using TeX, but it is well documented 2808*c42dbd0eSchristos elsewhere. 2809*c42dbd0eSchristos 2810*c42dbd0eSchristos * Instead of generating a single file in the 'html' format, it is 2811*c42dbd0eSchristos also possible to create a directory with individual files for the 2812*c42dbd0eSchristos various chapters. To do so, remove the use of '--no-split' in 2813*c42dbd0eSchristos variable 'MAKEINFOHTML' in the make file in the 'doc' directory. 2814*c42dbd0eSchristos 2815*c42dbd0eSchristos * The make file also supports commands to only generate the file in 2816*c42dbd0eSchristos the desired format and not move them to the documentation 2817*c42dbd0eSchristos directory. This is accomplished through the 'make <format>' 2818*c42dbd0eSchristos command. 2819*c42dbd0eSchristos 2820*c42dbd0eSchristos 2821*c42dbd0eSchristosFile: gprofng.info, Node: Index, Prev: Other Document Formats, Up: Top 2822*c42dbd0eSchristos 2823*c42dbd0eSchristosIndex 2824*c42dbd0eSchristos***** 2825*c42dbd0eSchristos 2826*c42dbd0eSchristos[index] 2827*c42dbd0eSchristos* Menu: 2828*c42dbd0eSchristos 2829*c42dbd0eSchristos* Command line mode: A First Profile. (line 39) 2830*c42dbd0eSchristos* Commands, calltree: The Call Tree. (line 10) 2831*c42dbd0eSchristos* Commands, compare delta: Comparison of Experiments. 2832*c42dbd0eSchristos (line 52) 2833*c42dbd0eSchristos* Commands, compare on/off: Comparison of Experiments. 2834*c42dbd0eSchristos (line 7) 2835*c42dbd0eSchristos* Commands, compare on/off <1>: Comparison of Experiments. 2836*c42dbd0eSchristos (line 51) 2837*c42dbd0eSchristos* Commands, compare ratio: Comparison of Experiments. 2838*c42dbd0eSchristos (line 52) 2839*c42dbd0eSchristos* Commands, compare ratio <1>: Examples Using Hardware Event Counters. 2840*c42dbd0eSchristos (line 166) 2841*c42dbd0eSchristos* Commands, cpus: Commands Specific to Multithreading. 2842*c42dbd0eSchristos (line 177) 2843*c42dbd0eSchristos* Commands, cpu_list: Commands Specific to Multithreading. 2844*c42dbd0eSchristos (line 176) 2845*c42dbd0eSchristos* Commands, disasm: The Disassembly View. 2846*c42dbd0eSchristos (line 8) 2847*c42dbd0eSchristos* Commands, experiment_list: Aggregation of Experiments. 2848*c42dbd0eSchristos (line 21) 2849*c42dbd0eSchristos* Commands, fsingle: Information on Load Objects. 2850*c42dbd0eSchristos (line 10) 2851*c42dbd0eSchristos* Commands, fsingle <1>: Information on Load Objects. 2852*c42dbd0eSchristos (line 36) 2853*c42dbd0eSchristos* Commands, fsummary: Information on Load Objects. 2854*c42dbd0eSchristos (line 10) 2855*c42dbd0eSchristos* Commands, fsummary <1>: Information on Load Objects. 2856*c42dbd0eSchristos (line 62) 2857*c42dbd0eSchristos* Commands, functions: A First Profile. (line 44) 2858*c42dbd0eSchristos* Commands, header: More Information on the Experiment. 2859*c42dbd0eSchristos (line 10) 2860*c42dbd0eSchristos* Commands, header <1>: Control the Sampling Frequency. 2861*c42dbd0eSchristos (line 59) 2862*c42dbd0eSchristos* Commands, limit: Control the Number of Lines in the Output. 2863*c42dbd0eSchristos (line 6) 2864*c42dbd0eSchristos* Commands, lines: The Source Code View. 2865*c42dbd0eSchristos (line 68) 2866*c42dbd0eSchristos* Commands, metrics: Display and Define the Metrics. 2867*c42dbd0eSchristos (line 11) 2868*c42dbd0eSchristos* Commands, metrics <1>: Display and Define the Metrics. 2869*c42dbd0eSchristos (line 38) 2870*c42dbd0eSchristos* Commands, metrics <2>: Metric Definitions. (line 9) 2871*c42dbd0eSchristos* Commands, metric_list: Display and Define the Metrics. 2872*c42dbd0eSchristos (line 10) 2873*c42dbd0eSchristos* Commands, metric_list <1>: Display and Define the Metrics. 2874*c42dbd0eSchristos (line 19) 2875*c42dbd0eSchristos* Commands, metric_list <2>: Examples Using Hardware Event Counters. 2876*c42dbd0eSchristos (line 261) 2877*c42dbd0eSchristos* Commands, objects: Information on Load Objects. 2878*c42dbd0eSchristos (line 10) 2879*c42dbd0eSchristos* Commands, overview: More Information on the Experiment. 2880*c42dbd0eSchristos (line 54) 2881*c42dbd0eSchristos* Commands, pcs: The Disassembly View. 2882*c42dbd0eSchristos (line 73) 2883*c42dbd0eSchristos* Commands, script: Scripting. (line 11) 2884*c42dbd0eSchristos* Commands, sort: Sorting the Performance Data. 2885*c42dbd0eSchristos (line 6) 2886*c42dbd0eSchristos* Commands, source: The Source Code View. 2887*c42dbd0eSchristos (line 12) 2888*c42dbd0eSchristos* Commands, threads: Commands Specific to Multithreading. 2889*c42dbd0eSchristos (line 35) 2890*c42dbd0eSchristos* Commands, thread_list: Commands Specific to Multithreading. 2891*c42dbd0eSchristos (line 10) 2892*c42dbd0eSchristos* Commands, thread_select: Commands Specific to Multithreading. 2893*c42dbd0eSchristos (line 64) 2894*c42dbd0eSchristos* Commands, thread_select <1>: Commands Specific to Multithreading. 2895*c42dbd0eSchristos (line 88) 2896*c42dbd0eSchristos* Commands, viewmode: Java Profiling. (line 37) 2897*c42dbd0eSchristos* Commands, viewmode <1>: The Viewmode. (line 6) 2898*c42dbd0eSchristos* Compare experiments: Comparison of Experiments. 2899*c42dbd0eSchristos (line 10) 2900*c42dbd0eSchristos* CPI: Hardware Event Counters Explained. 2901*c42dbd0eSchristos (line 31) 2902*c42dbd0eSchristos* CPU: The Concept of a CPU in gprofng. 2903*c42dbd0eSchristos (line 6) 2904*c42dbd0eSchristos* Default metrics: Display and Define the Metrics. 2905*c42dbd0eSchristos (line 34) 2906*c42dbd0eSchristos* ELF: Load Objects and Functions. 2907*c42dbd0eSchristos (line 16) 2908*c42dbd0eSchristos* Exclusive metric: Inclusive and Exclusive Metrics. 2909*c42dbd0eSchristos (line 14) 2910*c42dbd0eSchristos* Experiment directory: Steps Needed to Create a Profile. 2911*c42dbd0eSchristos (line 21) 2912*c42dbd0eSchristos* Filters, Thread selection: Commands Specific to Multithreading. 2913*c42dbd0eSchristos (line 64) 2914*c42dbd0eSchristos* Flavor field: Metric Definitions. (line 13) 2915*c42dbd0eSchristos* Function: Load Objects and Functions. 2916*c42dbd0eSchristos (line 26) 2917*c42dbd0eSchristos* gprofng display html: Steps Needed to Create a Profile. 2918*c42dbd0eSchristos (line 46) 2919*c42dbd0eSchristos* gprofng display text: Steps Needed to Create a Profile. 2920*c42dbd0eSchristos (line 42) 2921*c42dbd0eSchristos* Hardware event counters, alias name: Hardware Event Counters Explained. 2922*c42dbd0eSchristos (line 57) 2923*c42dbd0eSchristos* Hardware event counters, auto option: Examples Using Hardware Event Counters. 2924*c42dbd0eSchristos (line 21) 2925*c42dbd0eSchristos* Hardware event counters, counter definition: Getting Information on the Counters Supported. 2926*c42dbd0eSchristos (line 85) 2927*c42dbd0eSchristos* Hardware event counters, description: Hardware Event Counters Explained. 2928*c42dbd0eSchristos (line 6) 2929*c42dbd0eSchristos* Hardware event counters, hwc metric: Examples Using Hardware Event Counters. 2930*c42dbd0eSchristos (line 31) 2931*c42dbd0eSchristos* Hardware event counters, IPC: Examples Using Hardware Event Counters. 2932*c42dbd0eSchristos (line 250) 2933*c42dbd0eSchristos* Hardware event counters, variable CPU frequency: Hardware Event Counters Explained. 2934*c42dbd0eSchristos (line 40) 2935*c42dbd0eSchristos* Inclusive metric: Inclusive and Exclusive Metrics. 2936*c42dbd0eSchristos (line 9) 2937*c42dbd0eSchristos* Instruction level timings: The Disassembly View. 2938*c42dbd0eSchristos (line 9) 2939*c42dbd0eSchristos* Instruction pointer: The Program Counter. (line 10) 2940*c42dbd0eSchristos* Interpreter mode: A First Profile. (line 30) 2941*c42dbd0eSchristos* IPC: Hardware Event Counters Explained. 2942*c42dbd0eSchristos (line 20) 2943*c42dbd0eSchristos* Java profiling, -J <string>: Java Profiling. (line 29) 2944*c42dbd0eSchristos* Java profiling, -j on/off: Java Profiling. (line 6) 2945*c42dbd0eSchristos* Java profiling, <JVM-System>: The Viewmode. (line 15) 2946*c42dbd0eSchristos* Java profiling, <no Java callstack recorded>: The Viewmode. (line 18) 2947*c42dbd0eSchristos* Java profiling, different view modes: Java Profiling. (line 37) 2948*c42dbd0eSchristos* Java profiling, JAVA_PATH: Java Profiling. (line 24) 2949*c42dbd0eSchristos* Java profiling, JDK_HOME: Java Profiling. (line 23) 2950*c42dbd0eSchristos* Leaf function: Inclusive and Exclusive Metrics. 2951*c42dbd0eSchristos (line 19) 2952*c42dbd0eSchristos* List specification: The Selection List. (line 6) 2953*c42dbd0eSchristos* Load object: Load Objects and Functions. 2954*c42dbd0eSchristos (line 11) 2955*c42dbd0eSchristos* Load objects: Information on Load Objects. 2956*c42dbd0eSchristos (line 11) 2957*c42dbd0eSchristos* Metric name field: Metric Definitions. (line 13) 2958*c42dbd0eSchristos* Miscellaneous , ##: The Source Code View. 2959*c42dbd0eSchristos (line 63) 2960*c42dbd0eSchristos* Miscellaneous, <apath>: Information on Load Objects. 2961*c42dbd0eSchristos (line 16) 2962*c42dbd0eSchristos* Miscellaneous, <Total>: A First Profile. (line 86) 2963*c42dbd0eSchristos* mxv-pthreads.exe: The Example Program. (line 12) 2964*c42dbd0eSchristos* Options, -C: More Information on the Experiment. 2965*c42dbd0eSchristos (line 48) 2966*c42dbd0eSchristos* Options, -h: Getting Information on the Counters Supported. 2967*c42dbd0eSchristos (line 9) 2968*c42dbd0eSchristos* Options, -h <1>: Examples Using Hardware Event Counters. 2969*c42dbd0eSchristos (line 21) 2970*c42dbd0eSchristos* Options, -o: Name the Experiment Directory. 2971*c42dbd0eSchristos (line 15) 2972*c42dbd0eSchristos* Options, -O: Name the Experiment Directory. 2973*c42dbd0eSchristos (line 19) 2974*c42dbd0eSchristos* Options, -O <1>: A More Elaborate Example. 2975*c42dbd0eSchristos (line 12) 2976*c42dbd0eSchristos* Options, -p: The Call Tree. (line 66) 2977*c42dbd0eSchristos* Options, -p <1>: Control the Sampling Frequency. 2978*c42dbd0eSchristos (line 18) 2979*c42dbd0eSchristos* PC: The Program Counter. (line 6) 2980*c42dbd0eSchristos* PC <1>: Load Objects and Functions. 2981*c42dbd0eSchristos (line 32) 2982*c42dbd0eSchristos* PC sampling: Sampling versus Tracing. 2983*c42dbd0eSchristos (line 7) 2984*c42dbd0eSchristos* Posix Threads: The Example Program. (line 8) 2985*c42dbd0eSchristos* Program Counter: The Program Counter. (line 6) 2986*c42dbd0eSchristos* Program Counter <1>: Load Objects and Functions. 2987*c42dbd0eSchristos (line 32) 2988*c42dbd0eSchristos* Program Counter sampling: Sampling versus Tracing. 2989*c42dbd0eSchristos (line 7) 2990*c42dbd0eSchristos* Pthreads: The Example Program. (line 8) 2991*c42dbd0eSchristos* Sampling interval: Control the Sampling Frequency. 2992*c42dbd0eSchristos (line 20) 2993*c42dbd0eSchristos* Selection list: The Selection List. (line 6) 2994*c42dbd0eSchristos* Source level timings: The Source Code View. 2995*c42dbd0eSchristos (line 10) 2996*c42dbd0eSchristos* TeX: Other Document Formats. 2997*c42dbd0eSchristos (line 40) 2998*c42dbd0eSchristos* Thread affinity: Commands Specific to Multithreading. 2999*c42dbd0eSchristos (line 162) 3000*c42dbd0eSchristos* Total CPU time: A First Profile. (line 74) 3001*c42dbd0eSchristos* Viewmode: The Viewmode. (line 6) 3002*c42dbd0eSchristos* Visibility field: Sorting the Performance Data. 3003*c42dbd0eSchristos (line 9) 3004*c42dbd0eSchristos* Visibility field <1>: Metric Definitions. (line 13) 3005*c42dbd0eSchristos 3006*c42dbd0eSchristos 3007*c42dbd0eSchristos 3008*c42dbd0eSchristosTag Table: 3009*c42dbd0eSchristosNode: Top750 3010*c42dbd0eSchristosNode: Introduction3075 3011*c42dbd0eSchristosNode: Overview4352 3012*c42dbd0eSchristosNode: Main Features4993 3013*c42dbd0eSchristosNode: Sampling versus Tracing6733 3014*c42dbd0eSchristosNode: Steps Needed to Create a Profile9124 3015*c42dbd0eSchristosNode: A Mini Tutorial11311 3016*c42dbd0eSchristosNode: Getting Started12030 3017*c42dbd0eSchristosNode: The Example Program13680 3018*c42dbd0eSchristosNode: A First Profile14800 3019*c42dbd0eSchristosNode: The Source Code View19305 3020*c42dbd0eSchristosNode: The Disassembly View23801 3021*c42dbd0eSchristosNode: Display and Define the Metrics28490 3022*c42dbd0eSchristosNode: A First Customization of the Output30326 3023*c42dbd0eSchristosNode: Name the Experiment Directory32175 3024*c42dbd0eSchristosNode: Control the Number of Lines in the Output33272 3025*c42dbd0eSchristosNode: Sorting the Performance Data34031 3026*c42dbd0eSchristosNode: Scripting34973 3027*c42dbd0eSchristosNode: A More Elaborate Example35892 3028*c42dbd0eSchristosNode: The Call Tree38682 3029*c42dbd0eSchristosNode: More Information on the Experiment41423 3030*c42dbd0eSchristosNode: Control the Sampling Frequency44722 3031*c42dbd0eSchristosNode: Information on Load Objects47143 3032*c42dbd0eSchristosNode: Support for Multithreading50782 3033*c42dbd0eSchristosNode: Creating a Multithreading Experiment51422 3034*c42dbd0eSchristosNode: Commands Specific to Multithreading52935 3035*c42dbd0eSchristosNode: Viewing Multiple Experiments62341 3036*c42dbd0eSchristosNode: Aggregation of Experiments62918 3037*c42dbd0eSchristosNode: Comparison of Experiments65232 3038*c42dbd0eSchristosNode: Profile Hardware Event Counters69848 3039*c42dbd0eSchristosNode: Getting Information on the Counters Supported70556 3040*c42dbd0eSchristosNode: Examples Using Hardware Event Counters78156 3041*c42dbd0eSchristosNode: Java Profiling95853 3042*c42dbd0eSchristosNode: Terminology98222 3043*c42dbd0eSchristosNode: The Program Counter99234 3044*c42dbd0eSchristosNode: Inclusive and Exclusive Metrics99726 3045*c42dbd0eSchristosNode: Metric Definitions101179 3046*c42dbd0eSchristosNode: The Viewmode102364 3047*c42dbd0eSchristosNode: The Selection List103503 3048*c42dbd0eSchristosNode: Load Objects and Functions104905 3049*c42dbd0eSchristosNode: The Concept of a CPU in gprofng106919 3050*c42dbd0eSchristosNode: Hardware Event Counters Explained107680 3051*c42dbd0eSchristosNode: apath111472 3052*c42dbd0eSchristosNode: Other Document Formats112007 3053*c42dbd0eSchristosNode: Index114531 3054*c42dbd0eSchristos 3055*c42dbd0eSchristosEnd Tag Table 3056*c42dbd0eSchristos 3057*c42dbd0eSchristos 3058*c42dbd0eSchristosLocal Variables: 3059*c42dbd0eSchristoscoding: utf-8 3060*c42dbd0eSchristosEnd: 3061