1This is gprofng.info, produced by makeinfo version 6.8 from 2gprofng.texi. 3 4This document is the manual for gprofng, last updated 22 February 2022. 5 6 Copyright (C) 2022 Free Software Foundation, Inc. 7 8 Permission is granted to copy, distribute and/or modify this document 9under the terms of the GNU Free Documentation License, Version 1.3 or 10any later version published by the Free Software Foundation; with no 11Invariant Sections, with no Front-Cover texts, and with no Back-Cover 12Texts. A copy of the license is included in the section entitled "GNU 13Free Documentation License." 14 15INFO-DIR-SECTION Software development 16START-INFO-DIR-ENTRY 17* gprofng: (gprofng). The next generation profiling tool for Linux 18END-INFO-DIR-ENTRY 19 20 21File: gprofng.info, Node: Top, Next: Introduction, Up: (dir) 22 23GNU Gprofng 24*********** 25 26This document is the manual for gprofng, last updated 22 February 2022. 27 28 Copyright (C) 2022 Free Software Foundation, Inc. 29 30 Permission is granted to copy, distribute and/or modify this document 31under the terms of the GNU Free Documentation License, Version 1.3 or 32any later version published by the Free Software Foundation; with no 33Invariant Sections, with no Front-Cover texts, and with no Back-Cover 34Texts. A copy of the license is included in the section entitled "GNU 35Free Documentation License." 36 37* Menu: 38 39* Introduction:: About this manual. 40* Overview:: A brief overview of gprofng. 41* A Mini Tutorial:: A short tutorial covering the key features. 42* Terminology:: Various concepts and some terminology explained. 43* Other Document Formats:: How to create this document in other formats. 44* Index:: The index. 45 46 47-- The Detailed Node Listing -- 48 49Introduction 50 51Overview 52 53* Main Features:: A high level overview. 54* Sampling versus Tracing:: The pros and cons of sampling versus tracing. 55* Steps Needed to Create a Profile:: How to create a profile. 56 57A Mini Tutorial 58 59* Getting Started:: The basics of profiling with gprofng(). 60* Support for Multithreading:: Commands specific to multithreaded applications. 61* Viewing Multiple Experiments:: Analyze multiple experiments. 62* Profile Hardware Event Counters:: How to use hardware event counters. 63* Java Profiling:: How to profile a Java application. 64 65Terminology 66 67* The Program Counter:: What is a Program Counter? 68* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics. 69* Metric Definitions:: Definitions associated with metrics. 70* The Viewmode:: Select the way call stacks are presented. 71* The Selection List:: How to define a selection. 72* Load Objects and Functions:: The components in an application. 73* The Concept of a CPU in gprofng:: The definition of a CPU. 74* Hardware Event Counters Explained:: What are event counters? 75* apath:: Our generic definition of a path. 76 77 78 79 80File: gprofng.info, Node: Introduction, Next: Overview, Prev: Top, Up: Top 81 821 Introduction 83************** 84 85The gprofng tool is the next generation profiler for Linux. It consists 86of various commands to generate and display profile information. 87 88 This manual starts with a tutorial how to create and interpret a 89profile. This part is highly practical and has the goal to get users up 90to speed as quickly as possible. As soon as possible, we would like to 91show you how to get your first profile on your screen. 92 93 This is followed by more examples, covering many of the features. At 94the end of this tutorial, you should feel confident enough to tackle the 95more complex tasks. 96 97 In a future update a more formal reference manual will be included as 98well. Since even in this tutorial we use certain terminology, we have 99included a chapter with descriptions at the end. In case you encounter 100unfamiliar wordings or terminology, please check this chapter. 101 102 One word of caution. In several cases we had to somewhat tweak the 103screen output in order to make it fit. This is why the output may look 104somewhat different when you try things yourself. 105 106 For now, we wish you a smooth profiling experience with gprofng and 107good luck tackling performance bottlenecks. 108 109 110File: gprofng.info, Node: Overview, Next: A Mini Tutorial, Prev: Introduction, Up: Top 111 1122 A Brief Overview of gprofng 113***************************** 114 115* Menu: 116 117* Main Features:: A high level overview. 118* Sampling versus Tracing:: The pros and cons of sampling versus tracing. 119* Steps Needed to Create a Profile:: How to create a profile. 120 121Before we cover this tool in quite some detail, we start with a brief 122overview of what it is, and the main features. Since we know that many 123of you would like to get started rightaway, already in this first 124chapter we explain the basics of profiling with 'gprofng'. 125 126 127File: gprofng.info, Node: Main Features, Next: Sampling versus Tracing, Up: Overview 128 1292.1 Main Features 130================= 131 132These are the main features of the gprofng tool: 133 134 * Profiling is supported for an application written in C, C++, Java, 135 or Scala. 136 137 * Shared libraries are supported. The information is presented at 138 the instruction level. 139 140 * The following multithreading programming models are supported: 141 Pthreads, OpenMP, and Java threads. 142 143 * This tool works with unmodified production level executables. 144 There is no need to recompile the code, but if the '-g' option has 145 been used when building the application, source line level 146 information is available. 147 148 * The focus is on support for code generated with the 'gcc' compiler, 149 but there is some limited support for the 'icc' compiler as well. 150 Future improvements and enhancements will focus on 'gcc' though. 151 152 * Processors from Intel, AMD, and Arm are supported, but the level of 153 support depends on the architectural details. In particular, 154 hardware event counters may not be supported. 155 156 * Several views into the data are supported. For example, a function 157 overview where the time is spent, but also a source line, 158 disassembly, call tree and a caller-callees overview are available. 159 160 * Through filters, the user can zoom in on an area of interest. 161 162 * Two or more profiles can be aggregated, or used in a comparison. 163 This comparison can be obtained at the function, source line, and 164 disassembly level. 165 166 * Through a scripting language, and customization of the metrics 167 shown, the generation and creation of a profile can be fully 168 automated and provide tailored output. 169 170 171File: gprofng.info, Node: Sampling versus Tracing, Next: Steps Needed to Create a Profile, Prev: Main Features, Up: Overview 172 1732.2 Sampling versus Tracing 174=========================== 175 176A key difference with some other profiling tools is that the main data 177collection command 'gprofng collect app' mostly uses Program Counter 178(PC) sampling under the hood. 179 180 With _sampling_, the executable is stopped at regular intervals. 181Each time it is halted, key information is gathered and stored. This 182includes the Program Counter that keeps track of where the execution is. 183Hence the name. 184 185 Together with operational data, this information is stored in the 186experiment directory and can be viewed in the second phase. 187 188 For example, the PC information is used to derive where the program 189was when it was halted. Since the sampling interval is known, it is 190relatively easy to derive how much time was spent in the various parts 191of the program. 192 193 The opposite technique is generally referred to as _tracing_. With 194tracing, the target is instrumented with specific calls that collect the 195requested information. 196 197 These are some of the pros and cons of PC sampling verus tracing: 198 199 * Since there is no need to recompile, existing executables can be 200 used and the profile measures the behaviour of exactly the same 201 executable that is used in production runs. 202 203 With sampling, one inherently profiles a different executable 204 because the calls to the instrumentation library may affect the 205 compiler optimizations and run time behaviour. 206 207 * With sampling, there are very few restrictions on what can be 208 profiled and even without access to the source code, a basic 209 profile can be made. 210 211 * A downside of sampling is that, depending on the sampling 212 frequency, small functions may be missed or not captured 213 accurately. Although this is rare, this may happen and is the 214 reason why the user has control over the sampling rate. 215 216 * While tracing produces precise information, sampling is statistical 217 in nature. As a result, small variations may occur across 218 seemingly identical runs. We have not observed more than a few 219 percent deviation though. Especially if the target job executed 220 for a sufficiently long time. 221 222 * With sampling, it is not possible to get an accurate count how 223 often functions are called. 224 225 226File: gprofng.info, Node: Steps Needed to Create a Profile, Prev: Sampling versus Tracing, Up: Overview 227 2282.3 Steps Needed to Create a Profile 229==================================== 230 231Creating a profile takes two steps. First the profile data needs to be 232generated. This is followed by a viewing step to create a report from 233the information that has been gathered. 234 235 Every gprofng command starts with 'gprofng', the name of the driver. 236This is followed by a keyword to define the high level functionality. 237Depending on this keyword, a third qualifier may be needed to further 238narrow down the request. This combination is then followed by options 239that are specific to the functionality desired. 240 241 The command to gather, or "collect", the performance data is called 242'gprofng collect app'. Aside from numerous options, this command takes 243the name of the target executable as an input parameter. 244 245 Upon completion of the run, the performance data can be found in the 246newly created experiment directory. 247 248 Unless explicitly specified otherwise, a default name for this 249directory is chosen. The name is 'test.<n>.er' where 'n' is the first 250integer number not in use yet for such a name. 251 252 For example, the first time 'gprofng collect app' is invoked, an 253experiment directory with the name 'test.1.er' is created. 254 255 Upon a subsequent invocation of 'gprofng collect app' in the same 256directory, an experiment directory with the name 'test.2.er' will be 257created, and so forth. 258 259 Note that 'gprofng collect app' supports an option to explicitly name 260the experiment directory. Outside of the restriction that the name of 261this directory has to end with '.er', any valid directory name can be 262used for this. 263 264 Now that we have the performance data, the next step is to display 265it. 266 267 The most commonly used command to view the performance information is 268'gprofng display text'. This is a very extensive and customizable tool 269that produces the information in ASCII format. 270 271 Another option is to use 'gprofng display html'. This tool generates 272a directory with files in html format. These can be viewed in a 273browser, allowing for easy navigation through the profile data. 274 275 276File: gprofng.info, Node: A Mini Tutorial, Next: Terminology, Prev: Overview, Up: Top 277 2783 A Mini Tutorial 279***************** 280 281In this chapter we present and discuss the main functionality of 282'gprofng'. This will be a practical approach, using an example code to 283generate profile data and show how to get various performance reports. 284 285* Menu: 286 287* Getting Started:: The basics of profiling with gprofng(). 288* Support for Multithreading:: Commands specific to multithreaded applications. 289* Viewing Multiple Experiments:: Analyze multiple experiments. 290* Profile Hardware Event Counters:: How to use hardware event counters. 291* Java Profiling:: How to profile a Java application. 292 293 294File: gprofng.info, Node: Getting Started, Next: Support for Multithreading, Up: A Mini Tutorial 295 2963.1 Getting Started 297=================== 298 299The information presented here provides a good and common basis for many 300profiling tasks, but there are more features that you may want to 301leverage. 302 303 These are covered in subsequent sections in this chapter. 304 305* Menu: 306 307* The Example Program:: A description of the example program used. 308* A First Profile:: How to get the first profile. 309* The Source Code View:: Display the metrics in the source code. 310* The Disassembly View:: Display the metrics at the instruction level. 311* Display and Define the Metrics:: An example how to customize the metrics. 312* A First Customization of the Output:: An example how to customize the output. 313* Name the Experiment Directory:: Change the name of the experiment directory. 314* Control the Number of Lines in the Output:: Change the number of lines in the tables. 315* Sorting the Performance Data:: How to set the metric to sort by. 316* Scripting:: Use a script to execute the commands. 317* A More Elaborate Example:: An example of customization. 318* The Call Tree:: Display the dynamic call tree. 319* More Information on the Experiment:: How to get additional statistics. 320* Control the Sampling Frequency:: How to control the sampling granularity. 321* Information on Load Objects:: How to get more information on load objects. 322 323 324File: gprofng.info, Node: The Example Program, Next: A First Profile, Up: Getting Started 325 3263.1.1 The Example Program 327------------------------- 328 329Throughout this guide we use the same example C code that implements the 330multiplication of a vector of length n by an m by n matrix. The result 331is stored in a vector of length m. The algorithm has been parallelized 332using Posix Threads, or Pthreads for short. 333 334 The code was built using the 'gcc' compiler and the name of the 335executable is mxv-pthreads.exe. 336 337 The matrix sizes can be set through the '-m' and '-n' options. The 338number of threads is set with the '-t' option. To increase the duration 339of the run, the multiplication is executed repeatedly. 340 341 This is an example that multiplies a 3000 by 2000 matrix with a 342vector of length 2000 using 2 threads: 343 344 $ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2 345 mxv: error check passed - rows = 3000 columns = 2000 threads = 2 346 $ 347 348 The program performs an internal check to verify the results are 349correct. The result of this check is printed, followed by the matrix 350sizes and the number of threads used. 351 352 353File: gprofng.info, Node: A First Profile, Next: The Source Code View, Prev: The Example Program, Up: Getting Started 354 3553.1.2 A First Profile 356--------------------- 357 358The first step is to collect the performance data. It is important to 359remember that much more information is gathered than may be shown by 360default. Often a single data collection run is sufficient to get a lot 361of insight. 362 363 The 'gprofng collect app' command is used for the data collection. 364Nothing needs to be changed in the way the application is executed. The 365only difference is that it is now run under control of the tool, as 366shown below: 367 368 $ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1 369 370 This command produces the following output: 371 372 Creating experiment database test.1.er (Process ID: 2416504) ... 373 mxv: error check passed - rows = 3000 columns = 2000 threads = 1 374 375 We see the message that a directory with the name 'test.1.er' has 376been created. The application then completes as usual and we have our 377first experiment directory that can be analyzed. 378 379 The tool we use for this is called 'gprofng display text'. It takes 380the name of the experiment directory as an argument. 381 382 If invoked this way, the tool starts in the interactive _interpreter_ 383mode. While in this environment, commands can be given and the tool 384responds. This is illustrated below: 385 386 $ gprofng display text test.1.er 387 Warning: History and command editing is not supported on this system. 388 (gp-display-text) quit 389 $ 390 391 While useful in certain cases, we prefer to use this tool in command 392line mode, by specifying the commands to be issued when invoking the 393tool. The way to do this is to prepend the command with a hyphen ('-') 394if used on the command line. 395 396 For example, with the 'functions' command we request a list of the 397functions that have been executed and their respective CPU times: 398 399 $ gprofng display text -functions test.1.er 400 401 $ gprofng display text -functions test.1.er 402 Functions sorted by metric: Exclusive Total CPU Time 403 404 Excl. Incl. Name 405 Total Total 406 CPU sec. CPU sec. 407 2.272 2.272 <Total> 408 2.160 2.160 mxv_core 409 0.047 0.103 init_data 410 0.030 0.043 erand48_r 411 0.013 0.013 __drand48_iterate 412 0.013 0.056 drand48 413 0.008 0.010 _int_malloc 414 0.001 0.001 brk 415 0.001 0.002 sysmalloc 416 0. 0.001 __default_morecore 417 0. 0.113 __libc_start_main 418 0. 0.010 allocate_data 419 0. 2.160 collector_root 420 0. 2.160 driver_mxv 421 0. 0.113 main 422 0. 0.010 malloc 423 0. 0.001 sbrk 424 425 As easy and simple as these steps are, we do have a first profile of 426our program! There are three columns. The first two contain the _Total 427CPU Time_, which is the sum of the user and system time. *Note 428Inclusive and Exclusive Metrics:: for an explanation of "exclusive" and 429"inclusive" times. 430 431 The first line echoes the metric that is used to sort the output. By 432default, this is the exclusive CPU time, but the sort metric can be 433changed by the user. 434 435 We then see three columns with the exclusive and inclusive CPU times, 436plus the name of the function. 437 438 The function with the name '<Total>' is not a user function, but is 439introduced by 'gprofng' and is used to display the accumulated metric 440values. In this case, we see that the total CPU time of this job was 441'2.272' seconds. 442 443 With '2.160' seconds, function 'mxv_core' is the most time consuming 444function. It is also a leaf function. 445 446 The next function in the list is 'init_data'. Although the CPU time 447spent in this part is negligible, this is an interesting entry because 448the inclusive CPU time of '0.103' seconds is higher than the exclusive 449CPU time of '0.047' seconds. Clearly it is calling another function, or 450even more than one function. *Note The Call Tree:: for the details how 451to get more information on this. 452 453 The function 'collector_root' does not look familiar. It is one of 454the internal functions used by 'gprofng collect app' and can be ignored. 455While the inclusive time is high, the exclusive time is zero. This 456means it doesn't contribute to the performance. 457 458 The question is how we know where this function originates from? 459There is a very useful command to get more details on a function. *Note 460Information on Load Objects::. 461 462 463File: gprofng.info, Node: The Source Code View, Next: The Disassembly View, Prev: A First Profile, Up: Getting Started 464 4653.1.3 The Source Code View 466-------------------------- 467 468In general, you would like to focus the tuning efforts on the most time 469consuming part(s) of the program. In this case that is easy, since 4702.160 seconds on a total of 2.272 seconds is spent in function 471'mxv_core'. That is 95% of the total and it is time to dig deeper and 472look at the time distribution at the source code level. 473 474 The 'source' command is used to accomplish this. It takes the name 475of the function, not the source filename, as an argument. This is 476demonstrated below, where the 'gprofng display text' command is used to 477show the annotated source listing of function 'mxv_core'. 478 479 Please note that the source code has to be compiled with the '-g' 480option in order for the source code feature to work. Otherwise the 481location can not be determined. 482 483 $ gprofng display text -source mxv_core test.1.er 484 485 The slightly modified output is as follows: 486 487 Source file: <apath>/mxv.c 488 Object file: mxv-pthreads.exe (found as test.1.er/archives/...) 489 Load Object: mxv-pthreads.exe (found as test.1.er/archives/...) 490 491 Excl. Incl. 492 Total Total 493 CPU sec. CPU sec. 494 495 <lines deleted> 496 <Function: mxv_core> 497 0. 0. 32. void __attribute__ ((noinline)) 498 mxv_core ( 499 uint64_t row_index_start, 500 uint64_t row_index_end, 501 uint64_t m, uint64_t n, 502 double **restrict A, 503 double *restrict b, 504 double *restrict c) 505 0. 0. 33. { 506 0. 0. 34. for (uint64_t i=row_index_start; 507 i<=row_index_end; i++) { 508 0. 0. 35. double row_sum = 0.0; 509 ## 1.687 1.687 36. for (int64_t j=0; j<n; j++) 510 0.473 0.473 37. row_sum += A[i][j]*b[j]; 511 0. 0. 38. c[i] = row_sum; 512 39. } 513 0. 0. 40. } 514 515 The first three lines provide information on the location of the 516source file, the object file and the load object (*Note Load Objects and 517Functions::). 518 519 Function 'mxv_core' is part of a source file that has other functions 520as well. These functions will be shown, but without timing information. 521They have been removed in the output shown above. 522 523 This is followed by the annotated source code listing. The selected 524metrics are shown first, followed by a source line number, and the 525source code. The most time consuming line(s) are marked with the '##' 526symbol. In this way they are easier to find. 527 528 What we see is that all of the time is spent in lines 36-37. 529 530 A related command sometimes comes handy as well. It is called 531'lines' and displays a list of the source lines and their metrics, 532ordered according to the current sort metric (*Note Sorting the 533Performance Data::). 534 535 Below the command and the output. For lay-out reasons, only the top 53610 is shown here and the last part of the text on some lines has been 537replaced by dots. 538 539 $ gprofng display text -lines test.1.er 540 541 Lines sorted by metric: Exclusive Total CPU Time 542 543 Excl. Incl. Name 544 Total Total 545 CPU sec. CPU sec. 546 2.272 2.272 <Total> 547 1.687 1.687 mxv_core, line 36 in "mxv.c" 548 0.473 0.473 mxv_core, line 37 in "mxv.c" 549 0.032 0.088 init_data, line 72 in "manage_data.c" 550 0.030 0.043 <Function: erand48_r, instructions without line numbers> 551 0.013 0.013 <Function: __drand48_iterate, instructions without ...> 552 0.013 0.056 <Function: drand48, instructions without line numbers> 553 0.012 0.012 init_data, line 77 in "manage_data.c" 554 0.008 0.010 <Function: _int_malloc, instructions without ...> 555 0.003 0.003 init_data, line 71 in "manage_data.c" 556 557 What this overview immediately highlights is that the next most time 558consuming source line takes 0.032 seconds only. With an inclusive time 559of 0.088 seconds, it is also clear that this branch of the code does not 560impact the performance. 561 562 563File: gprofng.info, Node: The Disassembly View, Next: Display and Define the Metrics, Prev: The Source Code View, Up: Getting Started 564 5653.1.4 The Disassembly View 566-------------------------- 567 568The source view is very useful to obtain more insight where the time is 569spent, but sometimes this is not sufficient. This is when the 570disassembly view comes in. It is activated with the 'disasm' command 571and as with the source view, it displays an annotated listing. In this 572case it shows the instructions with the metrics, interleaved with the 573source lines. The instructions have a reference in square brackets ('[' 574and ']') to the source line they correspond to. 575 576This is what we get for our example: 577 578 $ gprofng display text -disasm mxv_core test.1.er 579 580 Source file: <apath>/mxv.c 581 Object file: mxv-pthreads.exe (found as test.1.er/archives/...) 582 Load Object: mxv-pthreads.exe (found as test.1.er/archives/...) 583 584 Excl. Incl. 585 Total Total 586 CPU sec. CPU sec. 587 588 <lines deleted> 589 32. void __attribute__ ((noinline)) 590 mxv_core ( 591 uint64_t row_index_start, 592 uint64_t row_index_end, 593 uint64_t m, uint64_t n, 594 double **restrict A, 595 double *restrict b, 596 double *restrict c) 597 33. { 598 <Function: mxv_core> 599 0. 0. [33] 4021ba: mov 0x8(%rsp),%r10 600 34. for (uint64_t i=row_index_start; 601 i<=row_index_end; i++) { 602 0. 0. [34] 4021bf: cmp %rsi,%rdi 603 0. 0. [34] 4021c2: jbe 0x37 604 0. 0. [34] 4021c4: ret 605 35. double row_sum = 0.0; 606 36. for (int64_t j=0; j<n; j++) 607 37. row_sum += A[i][j]*b[j]; 608 0. 0. [37] 4021c5: mov (%r8,%rdi,8),%rdx 609 0. 0. [36] 4021c9: mov $0x0,%eax 610 0. 0. [35] 4021ce: pxor %xmm1,%xmm1 611 0.002 0.002 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0 612 0.096 0.096 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0 613 0.375 0.375 [37] 4021dd: addsd %xmm0,%xmm1 614 ## 1.683 1.683 [36] 4021e1: add $0x1,%rax 615 0.004 0.004 [36] 4021e5: cmp %rax,%rcx 616 0. 0. [36] 4021e8: jne 0xffffffffffffffea 617 38. c[i] = row_sum; 618 0. 0. [38] 4021ea: movsd %xmm1,(%r10,%rdi,8) 619 0. 0. [34] 4021f0: add $0x1,%rdi 620 0. 0. [34] 4021f4: cmp %rdi,%rsi 621 0. 0. [34] 4021f7: jb 0xd 622 0. 0. [35] 4021f9: pxor %xmm1,%xmm1 623 0. 0. [36] 4021fd: test %rcx,%rcx 624 0. 0. [36] 402200: jne 0xffffffffffffffc5 625 0. 0. [36] 402202: jmp 0xffffffffffffffe8 626 39. } 627 40. } 628 0. 0. [40] 402204: ret 629 630 For each instruction, the timing values are given and we can exactly 631which ones are the most expensive. As with the source level view, the 632most expensive instructions are market with the '##' symbol. 633 634 As illustrated below and similar to the 'lines' command, we can get 635an overview of the instructions executed by using the 'pcs' command. 636 637Below the command and the output, which again has been restricted to 10 638lines: 639 640 $ gprofng display text -pcs test.1.er 641 642 PCs sorted by metric: Exclusive Total CPU Time 643 644 Excl. Incl. Name 645 Total Total 646 CPU sec. CPU sec. 647 2.272 2.272 <Total> 648 1.683 1.683 mxv_core + 0x00000027, line 36 in "mxv.c" 649 0.375 0.375 mxv_core + 0x00000023, line 37 in "mxv.c" 650 0.096 0.096 mxv_core + 0x0000001D, line 37 in "mxv.c" 651 0.027 0.027 init_data + 0x000000BD, line 72 in "manage_data.c" 652 0.012 0.012 init_data + 0x00000117, line 77 in "manage_data.c" 653 0.008 0.008 _int_malloc + 0x00000A45 654 0.007 0.007 erand48_r + 0x00000062 655 0.006 0.006 drand48 + 0x00000000 656 0.005 0.005 __drand48_iterate + 0x00000005 657 658 659File: gprofng.info, Node: Display and Define the Metrics, Next: A First Customization of the Output, Prev: The Disassembly View, Up: Getting Started 660 6613.1.5 Display and Define the Metrics 662------------------------------------ 663 664The default metrics shown by 'gprofng display text' are useful, but 665there is more recorded than displayed. We can customize the values 666shown by defining the metrics ourselves. 667 668 There are two commands related to changing the metrics shown: 669'metric_list' and 'metrics'. 670 671 The first command shows the metrics in use, plus all the metrics that 672have been stored as part of the experiment. The second command may be 673used to define the metric list. 674 675 In our example we get the following values for the metrics: 676 677 $ gprofng display text -metric_list test.1.er 678 679 Current metrics: e.totalcpu:i.totalcpu:name 680 Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) 681 Available metrics: 682 Exclusive Total CPU Time: e.%totalcpu 683 Inclusive Total CPU Time: i.%totalcpu 684 Size: size 685 PC Address: address 686 Name: name 687 688 This shows the metrics currently in use, the metric that is used to 689sort the data and all the metrics that have been recorded, but are not 690necessarily shown. 691 692 In this case, the default metrics are set to the exclusive and 693inclusive total CPU times, plus the name of the function, or load 694object. 695 696 The 'metrics' command is used to define the metrics that need to be 697displayed. 698 699 For example, to display the exclusive total CPU time, both as a 700number and a percentage, use the following metric definition: 701'e.%totalcpu' 702 703 Since the metrics can be tailored for different views, there is a way 704to reset them to the default. This is done through the special keyword 705'default'. 706 707 708File: gprofng.info, Node: A First Customization of the Output, Next: Name the Experiment Directory, Prev: Display and Define the Metrics, Up: Getting Started 709 7103.1.6 A First Customization of the Output 711----------------------------------------- 712 713With the information just given, we can customize the function overview. 714For sake of the example, we would like to display the name of the 715function first, followed by the exclusive CPU time, given as an absolute 716number and a percentage. 717 718 Note that the commands are parsed in order of appearance. This is 719why we need to define the metrics _before_ requesting the function 720overview: 721 722 $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er 723 724 Current metrics: name:e.%totalcpu 725 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 726 Functions sorted by metric: Exclusive Total CPU Time 727 728 Name Excl. Total 729 CPU 730 sec. % 731 <Total> 2.272 100.00 732 mxv_core 2.160 95.04 733 init_data 0.047 2.06 734 erand48_r 0.030 1.32 735 __drand48_iterate 0.013 0.57 736 drand48 0.013 0.57 737 _int_malloc 0.008 0.35 738 brk 0.001 0.04 739 sysmalloc 0.001 0.04 740 __default_morecore 0. 0. 741 __libc_start_main 0. 0. 742 allocate_data 0. 0. 743 collector_root 0. 0. 744 driver_mxv 0. 0. 745 main 0. 0. 746 malloc 0. 0. 747 sbrk 0. 0. 748 749 This was a first and simple example how to customize the output. 750Note that we did not rerun our profiling job and merely modified the 751display settings. Below we will show other and also more advanced 752examples of customization. 753 754 755File: gprofng.info, Node: Name the Experiment Directory, Next: Control the Number of Lines in the Output, Prev: A First Customization of the Output, Up: Getting Started 756 7573.1.7 Name the Experiment Directory 758----------------------------------- 759 760When using 'gprofng collect app', the default names for experiments work 761fine, but they are quite generic. It is often more convenient to select 762a more descriptive name. For example, one that reflects conditions for 763the experiment conducted. 764 765 For this, the mutually exclusive '-o' and '-O' options come in handy. 766Both may be used to provide a name for the experiment directory, but the 767behaviour of 'gprofng collect app' is different. 768 769 With the '-o' option, an existing experiment directory is not 770overwritten. You either need to explicitly remove an existing directory 771first, or use a name that is not in use yet. 772 773 This is in contrast with the behaviour for the '-O' option. Any 774existing (experiment) directory with the same name is silently 775overwritten. 776 777 Be aware that the name of the experiment directory has to end with 778'.er'. 779 780 781File: gprofng.info, Node: Control the Number of Lines in the Output, Next: Sorting the Performance Data, Prev: Name the Experiment Directory, Up: Getting Started 782 7833.1.8 Control the Number of Lines in the Output 784----------------------------------------------- 785 786The 'limit <n>' command can be used to control the number of lines 787printed in various overviews, including the function view, but it also 788takes effect for other display commands, like 'lines'. 789 790 The argument '<n>' should be a positive integer number. It sets the 791number of lines in the function view. A value of zero resets the limit 792to the default. 793 794 Be aware that the pseudo-function '<Total>' counts as a regular 795function. For example 'limit 10' displays nine user level functions. 796 797 798File: gprofng.info, Node: Sorting the Performance Data, Next: Scripting, Prev: Control the Number of Lines in the Output, Up: Getting Started 799 8003.1.9 Sorting the Performance Data 801---------------------------------- 802 803The 'sort <key>' command sets the key to be used when sorting the 804performance data. 805 806 The key is a valid metric definition, but the visibility field (*Note 807Metric Definitions::) in the metric definition is ignored since this 808does not affect the outcome of the sorting operation. For example if we 809set the sort key to 'e.totalcpu', the values will be sorted in 810descending order with respect to the exclusive total CPU time. 811 812 The data can be sorted in reverse order by prepending the metric 813definition with a minus ('-') sign. For example 'sort -e.totalcpu'. 814 815 A default metric for the sort operation has been defined and since 816this is a persistent command, this default can be restored with 817'default' as the key. 818 819 820File: gprofng.info, Node: Scripting, Next: A More Elaborate Example, Prev: Sorting the Performance Data, Up: Getting Started 821 8223.1.10 Scripting 823---------------- 824 825As is probably clear by now, the list with commands for 'gprofng display 826text' can be very long. This is tedious and also error prone. Luckily, 827there is an easier and more elegant way to control the behaviour of this 828tool. 829 830 Through the 'script' command, the name of a file with commands can be 831passed in. These commands are parsed and executed as if they appeared 832on the command line in the same order as encountered in the file. The 833commands in this script file can actually be mixed with commands on the 834command line. 835 836 The difference between the commands in the script file and those used 837on the command line is that the latter require a leading dash ('-') 838symbol. 839 840 Comment lines are supported. They need to start with the '#' symbol. 841 842 843File: gprofng.info, Node: A More Elaborate Example, Next: The Call Tree, Prev: Scripting, Up: Getting Started 844 8453.1.11 A More Elaborate Example 846------------------------------- 847 848With the information presented so far, we can customize our data 849gathering and display commands. 850 851 As an example, to reflect the name of the algorithm and the number of 852threads that were used in the experiment, we select 'mxv.1.thr.er' as 853the name of the experiment directory. All we then need to do is to add 854the '-O' option followed by this name on the command line when running 855'gprofng collect app': 856 857 $ exe=mxv-pthreads.exe 858 $ m=3000 859 $ n=2000 860 $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1 861 862 The commands to generate the profile are put into a file that we 863simply call 'my-script': 864 865 $ cat my-script 866 # This is my first gprofng script 867 # Set the metrics 868 metrics i.%totalcpu:e.%totalcpu:name 869 # Use the exclusive time to sort 870 sort e.totalcpu 871 # Limit the function list to 5 lines 872 limit 5 873 # Show the function list 874 functions 875 876 This script file is then specified as input to the 'gprofng display 877text' command that is used to display the performance information stored 878in 'mxv.1.thr.er': 879 880 $ gprofng display text -script my-script mxv.1.thr.er 881 882 The command above produces the following output: 883 884 # This is my first gprofng script 885 # Set the metrics 886 Current metrics: i.%totalcpu:e.%totalcpu:name 887 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 888 # Use the exclusive time to sort 889 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 890 # Limit the function list to 5 lines 891 Print limit set to 5 892 # Show the function list 893 Functions sorted by metric: Exclusive Total CPU Time 894 895 Incl. Total Excl. Total Name 896 CPU CPU 897 sec. % sec. % 898 2.272 100.00 2.272 100.00 <Total> 899 2.159 95.00 2.159 95.00 mxv_core 900 0.102 4.48 0.054 2.37 init_data 901 0.035 1.54 0.025 1.10 erand48_r 902 0.048 2.11 0.013 0.57 drand48 903 904 In the first part of the output, our comment lines in the script file 905are shown. These are interleaved with an acknowledgement message for 906the commands. 907 908 This is followed by a profile consisting of 5 lines only. For both 909metrics, the percentages plus the timings are given. The numbers are 910sorted with respect to the exclusive total CPU time. 911 912 It is now immediately clear that function 'mxv_core' is responsbile 913for 95% of the CPU time and 'init_data' takes 4.5% only. 914 915 This is also where we see sampling in action. Although this is 916exactly the same job we profiled before, the timings are somewhat 917different, but the differences are very small. 918 919 920File: gprofng.info, Node: The Call Tree, Next: More Information on the Experiment, Prev: A More Elaborate Example, Up: Getting Started 921 9223.1.12 The Call Tree 923-------------------- 924 925The call tree shows the dynamic hierarchy of the application by 926displaying the functions executed and their parent. It helps to find 927the most expensive path in the program. 928 929 This feature is enabled through the 'calltree' command. This is how 930to get this tree for our current experiment: 931 932 $ gprofng display text -calltree mxv.1.thr.er 933 934 This displays the following structure: 935 936 Functions Call Tree. Metric: Attributed Total CPU Time 937 938 Attr. Name 939 Total 940 CPU sec. 941 2.272 +-<Total> 942 2.159 +-collector_root 943 2.159 | +-driver_mxv 944 2.159 | +-mxv_core 945 0.114 +-__libc_start_main 946 0.114 +-main 947 0.102 +-init_data 948 0.048 | +-drand48 949 0.035 | +-erand48_r 950 0.010 | +-__drand48_iterate 951 0.011 +-allocate_data 952 0.011 | +-malloc 953 0.011 | +-_int_malloc 954 0.001 | +-sysmalloc 955 0.001 +-check_results 956 0.001 +-malloc 957 0.001 +-_int_malloc 958 959 At first sight this may not be what you expected and some explanation 960is in place. 961 962 First of all, function 'collector_root' is internal to 'gprofng' and 963should be hidden to the user. This is part of a planned future 964enhancement. 965 966 Recall that the 'objects' and 'fsingle' commands are very useful to 967find out more about load objects in general, but also to help identify 968an unknown entry in the function overview. *Note Load Objects and 969Functions::. 970 971 Another thing to note is that there are two main branches. The one 972under 'collector_root' and the second one under '__libc_start_main'. 973This reflects the fact that we are executing a parallel program. Even 974though we only used one thread for this run, this is still executed in a 975separate path. 976 977 The main, sequential part of the program is displayed under 'main' 978and shows the functions called and the time they took. 979 980 There are two things worth noting for the call tree feature: 981 982 * This is a dynamic tree and since sampling is used, it most likely 983 looks slighlty different across seemingly identical profile runs. 984 In case the run times are short, it is worth considering to use a 985 high resolution through the '-p' option. For example to use '-p 986 hi' to increase the sampling rate. 987 988 * In case hardware event counters have been enabled (*Note Profile 989 Hardware Event Counters::), these values are also displayed in the 990 call tree view. 991 992 993File: gprofng.info, Node: More Information on the Experiment, Next: Control the Sampling Frequency, Prev: The Call Tree, Up: Getting Started 994 9953.1.13 More Information on the Experiment 996----------------------------------------- 997 998The experiment directory not only contains performance related data. 999Several system characteristics, the actually command executed, and some 1000global performance statistics can be displayed. 1001 1002 The 'header' command displays information about the experiment(s). 1003For example, this is the command to extract this data from for our 1004experiment directory: 1005 1006 $ gprofng display text -header mxv.1.thr.er 1007 1008 The above command prints the following information. Note that some 1009of the lay-out and the information has been modified. The textual 1010changes are marked with the '<' and '>' symbols. 1011 1012 Experiment: mxv.1.thr.er 1013 No errors 1014 No warnings 1015 Archive command `gp-archive -n -a on 1016 --outfile <exp_dir>/archive.log <exp_dir>' 1017 1018 Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1' 1019 Process pid 30591, ppid 30589, pgrp 30551, sid 30468 1020 Current working directory: <cwd> 1021 Collector version: `2.36.50'; experiment version 12.4 (64-bit) 1022 Host `<hostname>', OS `Linux <version>', page size 4096, 1023 architecture `x86_64' 1024 16 CPUs, clock speed 1995 MHz. 1025 Memory: 30871514 pages @ 4096 = 120591 MB. 1026 Data collection parameters: 1027 Clock-profiling, interval = 997 microsecs. 1028 Periodic sampling, 1 secs. 1029 Follow descendant processes from: fork|exec|combo 1030 1031 Experiment started <date and time> 1032 1033 Experiment Ended: 2.293162658 1034 Data Collection Duration: 2.293162658 1035 1036 The output above may assist in troubleshooting, or to verify some of 1037the operational conditions and we recommand to include this command when 1038generating a profile. 1039 1040 Related to this command there is a useful option to record your own 1041comment(s) in an experiment. To this end, use the '-C' option on the 1042'gprofng collect app' tool to specify a comment string. Up to ten 1043comment lines can be included. These comments are displayed with the 1044'header' command on the 'gprofng display text' tool. 1045 1046 The 'overview' command displays information on the experiment(s) and 1047also shows a summary of the values for the metric(s) used. This is an 1048example how to use it on our newly created experiment directory: 1049 1050 $ gprofng display text -overview mxv.1.thr.er 1051 1052 Experiment(s): 1053 1054 Experiment :mxv.1.thr.er 1055 Target : './mxv-pthreads.exe -m 3000 -n 2000 -t 1' 1056 Host : <hostname> (<ISA>, Linux <version>) 1057 Start Time : <date and time> 1058 Duration : 2.293 Seconds 1059 1060 Metrics: 1061 1062 Experiment Duration (Seconds): [2.293] 1063 Clock Profiling 1064 [X]Total CPU Time - totalcpu (Seconds): [*2.272] 1065 1066 Notes: '*' indicates hot metrics, '[X]' indicates currently enabled 1067 metrics. 1068 The metrics command can be used to change selections. The 1069 metric_list command lists all available metrics. 1070 1071 This command provides a dashboard overview that helps to easily 1072identify where the time is spent and in case hardware event counters are 1073used, it shows their total values. 1074 1075 1076File: gprofng.info, Node: Control the Sampling Frequency, Next: Information on Load Objects, Prev: More Information on the Experiment, Up: Getting Started 1077 10783.1.14 Control the Sampling Frequency 1079------------------------------------- 1080 1081So far we did not talk about the frequency of the sampling process, but 1082in some cases it is useful to change the default of 10 milliseconds. 1083 1084 The advantage of increasing the sampling frequency is that functions 1085that do not take much time per invocation are more accurately captured. 1086The downside is that more data is gathered. This has an impact on the 1087overhead of the collection process and more disk space is required. 1088 1089 In general this is not an immediate concern, but with heavily 1090threaded applications that run for an extended period of time, 1091increasing the frequency may have a more noticeable impact. 1092 1093 The '-p' option on the 'gprofng collect app' tool is used to enable 1094or disable clock based profiling, or to explicitly set the sampling 1095rate. This option takes one of the following keywords: 1096 1097'off' 1098 Disable clock based profiling. 1099 1100'on' 1101 Enable clock based profiling with a per thread sampling interval of 1102 10 ms. This is the default. 1103 1104'lo' 1105 Enable clock based profiling with a per thread sampling interval of 1106 100 ms. 1107 1108'hi' 1109 Enable clock based profiling with a per thread sampling interval of 1110 1 ms. 1111 1112'<value>' 1113 Enable clock based profiling with a per thread sampling interval of 1114 <value>. 1115 1116 One may wonder why there is an option to disable clock based 1117profiling. This is because by default, it is enabled when conducting 1118hardware event counter experiments (*Note Profile Hardware Event 1119Counters::). With the '-p off' option, this can be disabled. 1120 1121 If an explicit value is set for the sampling, the number can be an 1122integer or a floating-point number. A suffix of 'u' for microseconds, 1123or 'm' for milliseconds is supported. If no suffix is used, the value 1124is assumed to be in milliseconds. 1125 1126 If the value is smaller than the clock profiling minimum, a warning 1127message is issued and it is set to the minimum. In case it is not a 1128multiple of the clock profiling resolution, it is silently rounded down 1129to the nearest multiple of the clock resolution. 1130 1131 If the value exceeds the clock profiling maximum, is negative, or 1132zero, an error is reported. 1133 1134 Note that the 'header' command echoes the sampling rate used. 1135 1136 1137File: gprofng.info, Node: Information on Load Objects, Prev: Control the Sampling Frequency, Up: Getting Started 1138 11393.1.15 Information on Load Objects 1140---------------------------------- 1141 1142It may happen that the function list contains a function that is not 1143known to the user. This can easily happen with library functions for 1144example. Luckily there are three commands that come in handy then. 1145 1146 These commands are 'objects', 'fsingle', and 'fsummary'. They 1147provide details on load objects (*Note Load Objects and Functions::). 1148 1149 The 'objects' command lists all load objects that have been 1150referenced during the performance experiment. Below we show the command 1151and the result for our profile job. Like before, the (long) path names 1152in the output have been shortened and replaced by the '<apath>' symbol 1153that represents an absolute directory path. 1154 1155 $ gprofng display text -objects mxv.1.thr.er 1156 1157 The output includes the name and path of the target executable: 1158 1159 <Unknown> (<Unknown>) 1160 <mxv-pthreads.exe> (<apath>/mxv-pthreads.exe) 1161 <librt-2.17.so> (/usr/lib64/librt-2.17.so) 1162 <libdl-2.17.so> (/usr/lib64/libdl-2.17.so) 1163 <libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>) 1164 <libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>) 1165 <libc-2.17.so> (/usr/lib64/libc-2.17.so) 1166 <libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so) 1167 <libm-2.17.so> (/usr/lib64/libm-2.17.so) 1168 <libgp-collector.so> (<apath>/libgp-collector.so) 1169 <ld-2.17.so> (/usr/lib64/ld-2.17.so) 1170 <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS) 1171 1172 The 'fsingle' command may be used to get more details on a specific 1173entry in the function view, say. For example, the command below 1174provides additional information on the 'collector_root' function shown 1175in the function overview. 1176 1177 $ gprofng display text -fsingle collector_root mxv.1.thr.er 1178 1179 Below the output from this command. It has been somewhat modified to 1180match the display requirements. 1181 1182 collector_root 1183 Exclusive Total CPU Time: 0. ( 0. %) 1184 Inclusive Total CPU Time: 2.159 ( 95.0%) 1185 Size: 401 1186 PC Address: 10:0x0001db60 1187 Source File: <apath>/dispatcher.c 1188 Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b 1189 Load Object: <apath>/libgp-collector.so 1190 Mangled Name: 1191 Aliases: 1192 1193 In this table we not only see how much time was spent in this 1194function, we also see where it originates from. In addition to this, 1195the size and start address are given as well. If the source code 1196location is known it is also shown here. 1197 1198 The related 'fsummary' command displays the same information as 1199'fsingle', but for all functions in the function overview, including 1200'<Total>': 1201 1202 $ gprofng display text -fsummary mxv.1.thr.er 1203 1204 Functions sorted by metric: Exclusive Total CPU Time 1205 1206 <Total> 1207 Exclusive Total CPU Time: 2.272 (100.0%) 1208 Inclusive Total CPU Time: 2.272 (100.0%) 1209 Size: 0 1210 PC Address: 1:0x00000000 1211 Source File: (unknown) 1212 Object File: (unknown) 1213 Load Object: <Total> 1214 Mangled Name: 1215 Aliases: 1216 1217 mxv_core 1218 Exclusive Total CPU Time: 2.159 ( 95.0%) 1219 Inclusive Total CPU Time: 2.159 ( 95.0%) 1220 Size: 75 1221 PC Address: 2:0x000021ba 1222 Source File: <apath>/mxv.c 1223 Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc 1224 Load Object: <apath>/mxv-pthreads.exe 1225 Mangled Name: 1226 Aliases: 1227 1228 ... etc ... 1229 1230 1231File: gprofng.info, Node: Support for Multithreading, Next: Viewing Multiple Experiments, Prev: Getting Started, Up: A Mini Tutorial 1232 12333.2 Support for Multithreading 1234============================== 1235 1236In this chapter we introduce and discuss the support for multithreading. 1237As is shown below, nothing needs to be changed when collecting the 1238performance data. 1239 1240 The difference is that additional commands are available to get more 1241information on the parallel environment, plus that several filters allow 1242the user to zoom in on specific threads. 1243 1244* Menu: 1245 1246* Creating a Multithreading Experiment:: 1247* Commands Specific to Multithreading:: 1248 1249 1250File: gprofng.info, Node: Creating a Multithreading Experiment, Next: Commands Specific to Multithreading, Up: Support for Multithreading 1251 12523.2.1 Creating a Multithreading Experiment 1253------------------------------------------ 1254 1255We demonstrate the support for multithreading using the same code and 1256settings as before, but this time we use 2 threads: 1257 1258 $ exe=mxv-pthreads.exe 1259 $ m=3000 1260 $ n=2000 1261 $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2 1262 1263 First of all, note that we did not change anything, other than 1264setting the number of threads to 2. Nothing special is needed to 1265profile a multithreaded job when using 'gprofng'. 1266 1267 The same is true when displaying the performance results. The same 1268commands that we used before work unmodified. For example, this is all 1269that is needed to get a function overview: 1270 1271 $ gpprofng display text -limit 10 -functions mxv.2.thr.er 1272 1273 This produces the following familiar looking output: 1274 1275 Print limit set to 10 1276 Functions sorted by metric: Exclusive Total CPU Time 1277 1278 Excl. Incl. Name 1279 Total Total 1280 CPU sec. CPU sec. 1281 2.268 2.268 <Total> 1282 2.155 2.155 mxv_core 1283 0.044 0.103 init_data 1284 0.030 0.046 erand48_r 1285 0.016 0.016 __drand48_iterate 1286 0.013 0.059 drand48 1287 0.008 0.011 _int_malloc 1288 0.003 0.003 brk 1289 0. 0.003 __default_morecore 1290 0. 0.114 __libc_start_main 1291 1292 1293File: gprofng.info, Node: Commands Specific to Multithreading, Prev: Creating a Multithreading Experiment, Up: Support for Multithreading 1294 12953.2.2 Commands Specific to Multithreading 1296----------------------------------------- 1297 1298The function overview shown above shows the results aggregated over all 1299the threads. The interesting new element is that we can also look at 1300the performance data for the individual threads. 1301 1302 The 'thread_list' command displays how many threads have been used: 1303 1304 $ gprofng display text -thread_list mxv.2.thr.er 1305 1306 This produces the following output, showing that three threads have 1307been used: 1308 1309 Exp Sel Total 1310 === === ===== 1311 1 all 3 1312 1313 The output confirms there is one experiment and that by default all 1314threads are selected. 1315 1316 It may seem surprising to see three threads here, since we used the 1317'-t 2' option, but it is common for a Pthreads program to use one 1318additional thread. This is typically the thread that runs from start to 1319finish and handles the sequential portions of the code, as well as takes 1320care of managing the threads. 1321 1322 It is no different in our example code. At some point, the main 1323thread creates and activates the two threads that perform the 1324multiplication of the matrix with the vector. Upon completion of this 1325computation, the main thread continues. 1326 1327 The 'threads' command is simple, yet very powerful. It shows the 1328total value of the metrics for each thread. To make it easier to 1329interpret the data, we modify the metrics to include percentages: 1330 1331 $ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er 1332 1333 The command above produces the following overview: 1334 1335 Current metrics: e.%totalcpu:name 1336 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 1337 Objects sorted by metric: Exclusive Total CPU Time 1338 1339 Excl. Total Name 1340 CPU 1341 sec. % 1342 2.258 100.00 <Total> 1343 1.075 47.59 Process 1, Thread 3 1344 1.070 47.37 Process 1, Thread 2 1345 0.114 5.03 Process 1, Thread 1 1346 1347 The first line gives the total CPU time accumulated over the threads 1348selected. This is followed by the metric value(s) for each thread. 1349 1350 From this it is clear that the main thread is responsible for 5% of 1351the total CPU time, while the other two threads take 47% each. 1352 1353 This view is ideally suited to verify if there any load balancing 1354issues and also to find the most time consuming thread(s). 1355 1356 While useful, often more information than this is needed. This is 1357where the thread selection filter comes in. Through the 'thread_select' 1358command, one or more threads may be selected (*Note The Selection List:: 1359how to define the selection list). 1360 1361 Since it is most common to use this command in a script, we do so as 1362well here. Below the script we are using: 1363 1364 # Define the metrics 1365 metrics e.%totalcpu 1366 # Limit the output to 10 lines 1367 limit 10 1368 # Get the function overview for thread 1 1369 thread_select 1 1370 functions 1371 # Get the function overview for thread 2 1372 thread_select 2 1373 functions 1374 # Get the function overview for thread 3 1375 thread_select 3 1376 functions 1377 1378 The definition of the metrics and the output limiter has been shown 1379and explained before and will be ignored. The new command we focus on 1380is 'thread_select'. 1381 1382 This command takes a list (*Note The Selection List::) to select 1383specific threads. In this case we simply use the individual thread 1384numbers that we obtained with the 'thread_list' command earlier. 1385 1386 This restricts the output of the 'functions' command to the thread 1387number(s) specified. This means that the script above shows which 1388function(s) each thread executes and how much CPU time they consumed. 1389Both the timings and their percentages are given. 1390 1391 This is the relevant part of the output for the first thread: 1392 1393 # Get the function overview for thread 1 1394 Exp Sel Total 1395 === === ===== 1396 1 1 3 1397 Functions sorted by metric: Exclusive Total CPU Time 1398 1399 Excl. Total Name 1400 CPU 1401 sec. % 1402 0.114 100.00 <Total> 1403 0.051 44.74 init_data 1404 0.028 24.56 erand48_r 1405 0.017 14.91 __drand48_iterate 1406 0.010 8.77 _int_malloc 1407 0.008 7.02 drand48 1408 0. 0. __libc_start_main 1409 0. 0. allocate_data 1410 0. 0. main 1411 0. 0. malloc 1412 1413 As usual, the comment lines are echoed. This is followed by a 1414confirmation of our selection. We see that indeed thread 1 has been 1415selected. What is displayed next is the function overview for this 1416particular thread. Due to the 'limit 10' command, there are ten entries 1417in this list. 1418 1419 Below are the overviews for threads 2 and 3 respectively. We see 1420that all of the CPU time is spent in function 'mxv_core' and that this 1421time is approximately the same for both threads. 1422 1423 # Get the function overview for thread 2 1424 Exp Sel Total 1425 === === ===== 1426 1 2 3 1427 Functions sorted by metric: Exclusive Total CPU Time 1428 1429 Excl. Total Name 1430 CPU 1431 sec. % 1432 1.072 100.00 <Total> 1433 1.072 100.00 mxv_core 1434 0. 0. collector_root 1435 0. 0. driver_mxv 1436 1437 # Get the function overview for thread 3 1438 Exp Sel Total 1439 === === ===== 1440 1 3 3 1441 Functions sorted by metric: Exclusive Total CPU Time 1442 1443 Excl. Total Name 1444 CPU 1445 sec. % 1446 1.076 100.00 <Total> 1447 1.076 100.00 mxv_core 1448 0. 0. collector_root 1449 0. 0. driver_mxv 1450 1451 When analyzing the performance of a multithreaded application, it is 1452sometimes useful to know whether threads have mostly executed on the 1453same core, say, or if they have wandered across multiple cores. This 1454sort of stickiness is usually referred to as _thread affinity_. 1455 1456 Similar to the commands for the threads, there are several commands 1457related to the usage of the cores, or _CPUs_ as they are called in 1458'gprofng' (*Note The Concept of a CPU in gprofng::). 1459 1460 In order to have some more interesting data to look at, we created a 1461new experiment, this time using 8 threads: 1462 1463 $ exe=mxv-pthreads.exe 1464 $ m=3000 1465 $ n=2000 1466 $ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8 1467 1468 Similar to the 'thread_list' command, the 'cpu_list' command displays 1469how many CPUs have been used. The equivalent of the 'threads' threads 1470command, is the 'cpus' command, which shows the CPU numbers that were 1471used and how much time was spent on each of them. Both are demonstrated 1472below. 1473 1474 $ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er 1475 1476 This command produces the following output: 1477 1478 Current metrics: e.%totalcpu:name 1479 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 1480 Exp Sel Total 1481 === === ===== 1482 1 all 10 1483 Objects sorted by metric: Exclusive Total CPU Time 1484 1485 Excl. Total Name 1486 CPU 1487 sec. % 1488 2.310 100.00 <Total> 1489 0.286 12.39 CPU 7 1490 0.284 12.30 CPU 13 1491 0.282 12.21 CPU 5 1492 0.280 12.13 CPU 14 1493 0.266 11.52 CPU 9 1494 0.265 11.48 CPU 2 1495 0.264 11.44 CPU 11 1496 0.194 8.42 CPU 0 1497 0.114 4.92 CPU 1 1498 0.074 3.19 CPU 15 1499 1500 What we see in this table is that a total of 10 CPUs have been used. 1501This is followed by a list with all the CPU numbers that have been used 1502during the run. For each CPU it is shown how much time was spent on it. 1503 1504 While the table with thread times shown earlier may point at a load 1505imbalance in the application, this overview has a different purpose. 1506 1507 For example, we see that 10 CPUs have been used, but we know that the 1508application uses 9 threads only. This means that at least one thread 1509has executed on more than one CPU. In itself this is not something to 1510worry about, but warrants a deeper investigation. 1511 1512 Honesty dictates that next we performed a pre-analysis to find out 1513which thread(s) have been running on more than one CPU. We found this to 1514be thread 7. It has executed on CPUs 0 and 15. 1515 1516 With this knowledge, we wrote the script shown below. It zooms in on 1517the behaviour of thread 7. 1518 1519 # Define the metrics 1520 metrics e.%totalcpu 1521 # Limit the output to 10 lines 1522 limit 10 1523 functions 1524 # Get the function overview for CPU 0 1525 cpu_select 0 1526 functions 1527 # Get the function overview for CPU 15 1528 cpu_select 15 1529 functions 1530 1531 From the earlier shown threads overview, we know that thread 7 has 1532used '0.268' seconds of CPU time.. 1533 1534 By selecting CPUs 0 and 15, respectively, we get the following 1535function overviews: 1536 1537 # Get the function overview for CPU 0 1538 Exp Sel Total 1539 === === ===== 1540 1 0 10 1541 Functions sorted by metric: Exclusive Total CPU Time 1542 1543 Excl. Total Name 1544 CPU 1545 sec. % 1546 0.194 100.00 <Total> 1547 0.194 100.00 mxv_core 1548 0. 0. collector_root 1549 0. 0. driver_mxv 1550 1551 # Get the function overview for CPU 15 1552 Exp Sel Total 1553 === === ===== 1554 1 15 10 1555 Functions sorted by metric: Exclusive Total CPU Time 1556 1557 Excl. Total Name 1558 CPU 1559 sec. % 1560 0.074 100.00 <Total> 1561 0.074 100.00 mxv_core 1562 0. 0. collector_root 1563 0. 0. driver_mxv 1564 1565 This shows that thread 7 spent '0.194' seconds on CPU 0 and '0.074' 1566seconds on CPU 15. 1567 1568 1569File: gprofng.info, Node: Viewing Multiple Experiments, Next: Profile Hardware Event Counters, Prev: Support for Multithreading, Up: A Mini Tutorial 1570 15713.3 Viewing Multiple Experiments 1572================================ 1573 1574One thing we did not cover sofar is that 'gprofng' fully supports the 1575analysis of multiple experiments. The 'gprofng display text' tool 1576accepts a list of experiments. The data can either be aggregated across 1577the experiments, or used in a comparison. 1578 1579 Mention 'experiment_list' 1580 1581* Menu: 1582 1583* Aggregation of Experiments:: 1584* Comparison of Experiments:: 1585 1586 1587File: gprofng.info, Node: Aggregation of Experiments, Next: Comparison of Experiments, Up: Viewing Multiple Experiments 1588 15893.3.1 Aggregation of Experiments 1590-------------------------------- 1591 1592By default, the data for multiple experiments is aggregrated and the 1593display commands shows these combined results. 1594 1595 For example, we can aggregate the data for our single and dual thread 1596experiments. Below is the script we used for this: 1597 1598 # Define the metrics 1599 metrics e.%totalcpu 1600 # Limit the output to 10 lines 1601 limit 10 1602 # Get the list with experiments 1603 experiment_list 1604 # Get the function overview 1605 functions 1606 1607 With the exception of the 'experiment_list' command, all commands 1608used have been discussed earlier. 1609 1610 The 'experiment_list' command provides a list of the experiments that 1611have been loaded. This is is used to verify we are looking at the 1612experiments we intend to aggregate. 1613 1614 $ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er 1615 1616 With the command above, we get the following output: 1617 1618 # Define the metrics 1619 Current metrics: e.%totalcpu:name 1620 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) 1621 # Limit the output to 10 lines 1622 Print limit set to 10 1623 # Get the list with experiments 1624 ID Sel PID Experiment 1625 == === ===== ============ 1626 1 yes 30591 mxv.1.thr.er 1627 2 yes 11629 mxv.2.thr.er 1628 # Get the function overview 1629 Functions sorted by metric: Exclusive Total CPU Time 1630 1631 Excl. Total Name 1632 CPU 1633 sec. % 1634 4.533 100.00 <Total> 1635 4.306 94.99 mxv_core 1636 0.105 2.31 init_data 1637 0.053 1.17 erand48_r 1638 0.027 0.59 __drand48_iterate 1639 0.021 0.46 _int_malloc 1640 0.021 0.46 drand48 1641 0.001 0.02 sysmalloc 1642 0. 0. __libc_start_main 1643 0. 0. allocate_data 1644 1645 The first five lines should look familiar. The five lines following, 1646echo the comment line in the script and show the overview of the 1647experiments. This confirms two experiments have been loaded and that 1648both are active. 1649 1650 This is followed by the function overview. The timings have been 1651summed up and the percentages are adjusted accordingly. For example, 1652the total accumulated time is indeed 2.272 + 2.261 = 4.533 seconds. 1653 1654 1655File: gprofng.info, Node: Comparison of Experiments, Prev: Aggregation of Experiments, Up: Viewing Multiple Experiments 1656 16573.3.2 Comparison of Experiments 1658------------------------------- 1659 1660The support for multiple experiments really shines in comparison mode. 1661This feature is enabled through the command 'compare on' and is disabled 1662by setting 'compare off'. 1663 1664 In comparison mode, the data for the various experiments is shown 1665side by side, as illustrated below where we compare the results for the 1666multithreaded experiments using one and two threads respectively: 1667 1668 $ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er 1669 1670This produces the following output: 1671 1672 Functions sorted by metric: Exclusive Total CPU Time 1673 1674 mxv.1.thr.er mxv.2.thr.er mxv.1.thr.er mxv.2.thr.er 1675 Excl. Total Excl. Total Incl. Total Incl. Total Name 1676 CPU CPU CPU CPU 1677 sec. sec. sec. sec. 1678 2.272 2.261 2.272 2.261 <Total> 1679 2.159 2.148 2.159 2.148 mxv_core 1680 0.054 0.051 0.102 0.104 init_data 1681 0.025 0.028 0.035 0.045 erand48_r 1682 0.013 0.008 0.048 0.053 drand48 1683 0.011 0.010 0.012 0.010 _int_malloc 1684 0.010 0.017 0.010 0.017 __drand48_iterate 1685 0.001 0. 0.001 0. sysmalloc 1686 0. 0. 0.114 0.114 __libc_start_main 1687 0. 0. 0.011 0.010 allocate_data 1688 0. 0. 0.001 0. check_results 1689 0. 0. 2.159 2.148 collector_root 1690 0. 0. 2.159 2.148 driver_mxv 1691 0. 0. 0.114 0.114 main 1692 0. 0. 0.012 0.010 malloc 1693 1694 This table is already helpful to more easily compare (two) profiles, 1695but there is more that we can do here. 1696 1697 By default, in comparison mode, all measured values are shown. Often 1698profiling is about comparing performance data. It is therefore more 1699useful to look at differences, or ratios, using one experiment as a 1700reference. 1701 1702 The values shown are relative to this difference. For example if a 1703ratio is below one, it means the reference value was higher. 1704 1705 This feature is supported on the 'compare' command. In addition to 1706'on', or 'off', this command also supports 'delta', or 'ratio'. 1707 1708 Usage of one of these two keywords enables the comparison feature and 1709shows either the difference, or the ratio, relative to the reference 1710data. 1711 1712 In the example below, we use the same two experiments used in the 1713comparison above, but as before, the number of lines is restricted to 10 1714and we focus on the exclusive timings plus percentages. For the 1715comparison part we are interested in the differences. 1716 1717 This is the script that produces such an overview: 1718 1719 # Define the metrics 1720 metrics e.%totalcpu 1721 # Limit the output to 10 lines 1722 limit 10 1723 # Set the comparison mode to differences 1724 compare delta 1725 # Get the function overview 1726 functions 1727 1728 Assuming this script file is called 'my-script-comp', this is how we 1729get the table displayed on our screen: 1730 1731 $ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er 1732 1733 Leaving out some of the lines printed, but we have seen before, we 1734get the following table: 1735 1736 mxv.1.thr.er mxv.2.thr.er 1737 Excl. Total Excl. Total Name 1738 CPU CPU 1739 sec. % delta % 1740 2.272 100.00 -0.011 100.00 <Total> 1741 2.159 95.00 -0.011 94.97 mxv_core 1742 0.054 2.37 -0.003 2.25 init_data 1743 0.025 1.10 +0.003 1.23 erand48_r 1744 0.013 0.57 -0.005 0.35 drand48 1745 0.011 0.48 -0.001 0.44 _int_malloc 1746 0.010 0.44 +0.007 0.75 __drand48_iterate 1747 0.001 0.04 -0.001 0. sysmalloc 1748 0. 0. +0. 0. __libc_start_main 1749 0. 0. +0. 0. allocate_data 1750 1751 It is now easy to see that the CPU times for the most time consuming 1752functions in this code are practically the same. 1753 1754 While in this case we used the delta as a comparison, 1755 1756 Note that the comparison feature is supported at the function, 1757source, and disassembly level. There is no practical limit on the 1758number of experiments that can be used in a comparison. 1759 1760 1761File: gprofng.info, Node: Profile Hardware Event Counters, Next: Java Profiling, Prev: Viewing Multiple Experiments, Up: A Mini Tutorial 1762 17633.4 Profile Hardware Event Counters 1764=================================== 1765 1766Many processors provide a set of hardware event counters and 'gprofng' 1767provides support for this feature. *Note Hardware Event Counters 1768Explained:: for those readers that are not familiar with such counters 1769and like to learn more. 1770 1771 In this section we explain how to get the details on the event 1772counter support for the processor used in the experiment(s), and show 1773several examples. 1774 1775* Menu: 1776 1777* Getting Information on the Counters Supported:: 1778* Examples Using Hardware Event Counters:: 1779 1780 1781File: gprofng.info, Node: Getting Information on the Counters Supported, Next: Examples Using Hardware Event Counters, Up: Profile Hardware Event Counters 1782 17833.4.1 Getting Information on the Counters Supported 1784--------------------------------------------------- 1785 1786The first step is to check if the processor used for the experiments is 1787supported by 'gprofng'. 1788 1789 The '-h' option on 'gprofng collect app' will show the event counter 1790information: 1791 1792 $ gprofng collect app -h 1793 1794 In case the counters are supported, a list with the events is 1795printed. Otherwise, a warning message will be issued. 1796 1797 For example, below we show this command and the output on an Intel 1798Xeon Platinum 8167M (aka "Skylake") processor. The output has been 1799split into several sections and each section is commented upon 1800separately. 1801 1802 Run "gprofng collect app --help" for a usage message. 1803 1804 Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85' 1805 (cpuver=2499): 1806 1807 -h {auto|lo|on|hi} 1808 turn on default set of HW counters at the specified rate 1809 -h <ctr_def> [-h <ctr_def>]... 1810 -h <ctr_def>[,<ctr_def>]... 1811 specify HW counter profiling for up to 4 HW counters 1812 1813 The first line shows how to get a usage overview. This is followed 1814by some information on the target processor. 1815 1816 The next five lines explain in what ways the '-h' option can be used 1817to define the events to be monitored. 1818 1819 The first version shown above enables a default set of counters. 1820This default depends on the processor this command is executed on. The 1821keyword following the '-h' option defines the sampling rate: 1822 1823'auto' 1824 Match the sample rate of used by clock profiling. If the latter is 1825 disabled, Use a per thread sampling rate of approximately 100 1826 samples per second. This setting is the default and preferred. 1827 1828'on' 1829 Use a per thread sampling rate of approximately 100 samples per 1830 second. 1831 1832'lo' 1833 Use a per thread sampling rate of approximately 10 samples per 1834 second. 1835 1836'hi' 1837 Use a per thread sampling rate of approximately 1000 samples per 1838 second. 1839 1840 The second and third variant define the events to be monitored. Note 1841that the number of simultaneous events supported is printed. In this 1842case we can monitor four events in a single profiling job. 1843 1844 It is a matter of preference whether you like to use the '-h' option 1845for each event, or use it once, followed by a comma separated list. 1846 1847 There is one slight catch though. The counter definition below has 1848mandatory comma (',') between the event and the rate. While a default 1849can be used for the rate, the comma cannot be omitted. This may result 1850in a somewhat awkward counter definition in case the default sampling 1851rate is used. 1852 1853 For example, the following two commands are equivalent. Note the 1854double comma in the second command. This is not a typo. 1855 1856 $ gprofng collect app -h cycles -h insts ... 1857 $ gprofng collect app -h cycles,,insts ... 1858 1859 In the first command this comma is not needed, because a comma 1860("','") immediately followed by white space may be omitted. 1861 1862 This is why we prefer the this syntax and in the remainder will use 1863the first version of this command. 1864 1865 The counter definition takes an event name, plus optionally one or 1866more attributes, followed by a comma, and optionally the sampling rate. 1867The output section below shows the formal definition. 1868 1869 <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>] 1870 1871 The printed help then explains this syntax. Below we have summarized 1872and expanded this output: 1873 1874'<ctr>' 1875 The counter name must be selected from the available counters 1876 listed as part of the output printed with the '-h' option. On most 1877 systems, if a counter is not listed, it may still be specified by 1878 its numeric value. 1879 1880'~<attr>=<val>' 1881 This is an optional attribute that depends on the processor. The 1882 list of supported attributes is printed in the output. Examples of 1883 attributes are "user", or "system". The value can given in decimal 1884 or hexadecimal format. Multiple attributes may be specified, and 1885 each must be preceded by a ~. 1886 1887'<rate>' 1888 1889 The sampling rate is one of the following: 1890 1891 'auto' 1892 This is the default and matches the rate used by clock 1893 profiling. If clock profiling is disabled, use 'on'. 1894 1895 'on' 1896 Set the per thread maximum sampling rate to ~100 1897 samples/second 1898 1899 'lo' 1900 Set the per thread maximum sampling rate to ~10 samples/second 1901 1902 'hi' 1903 Set the per thread maximum sampling rate to ~1000 1904 samples/second 1905 1906 '<interval>' 1907 Define the sampling interval. *Note Control the Sampling 1908 Frequency:: how to define this. 1909 1910 After the section with the formal definition of events and counters, 1911a processor specific list is displayed. This part starts with an 1912overview of the default set of counters and the aliased names supported 1913_on this specific processor_. 1914 1915 Default set of HW counters: 1916 1917 -h cycles,,insts,,llm 1918 1919 Aliases for most useful HW counters: 1920 1921 alias raw name type units regs description 1922 1923 cycles unhalted-core-cycles CPU-cycles 0123 CPU Cycles 1924 insts instruction-retired events 0123 Instructions Executed 1925 llm llc-misses events 0123 Last-Level Cache Misses 1926 br_msp branch-misses-retired events 0123 Branch Mispredict 1927 br_ins branch-instruction-retired events 0123 Branch Instructions 1928 1929 The definitions given above may or may not be available on other 1930processors, but we try to maximize the overlap across alias sets. 1931 1932 The table above shows the default set of counters defined for this 1933processor, and the aliases. For each alias the full "raw" name is 1934given, plus the unit of the number returned by the counter (CPU cycles, 1935or a raw count), the hardware counter the event is allowed to be mapped 1936onto, and a short description. 1937 1938 The last part of the output contains all the events that can be 1939monitored: 1940 1941 Raw HW counters: 1942 1943 name type units regs description 1944 1945 unhalted-core-cycles CPU-cycles 0123 1946 unhalted-reference-cycles events 0123 1947 instruction-retired events 0123 1948 llc-reference events 0123 1949 llc-misses events 0123 1950 branch-instruction-retired events 0123 1951 branch-misses-retired events 0123 1952 ld_blocks.store_forward events 0123 1953 ld_blocks.no_sr events 0123 1954 ld_blocks_partial.address_alias events 0123 1955 dtlb_load_misses.miss_causes_a_walk events 0123 1956 dtlb_load_misses.walk_completed_4k events 0123 1957 1958 <many lines deleted> 1959 1960 l2_lines_out.silent events 0123 1961 l2_lines_out.non_silent events 0123 1962 l2_lines_out.useless_hwpf events 0123 1963 sq_misc.split_lock events 0123 1964 1965 See Chapter 19 of the "Intel 64 and IA-32 Architectures Software 1966 Developer's Manual Volume 3B: System Programming Guide" 1967 1968 As can be seen, these names are not always easy to correlate to a 1969specific event of interest. The processor manual should provide more 1970clarity on this. 1971 1972 1973File: gprofng.info, Node: Examples Using Hardware Event Counters, Prev: Getting Information on the Counters Supported, Up: Profile Hardware Event Counters 1974 19753.4.2 Examples Using Hardware Event Counters 1976-------------------------------------------- 1977 1978The previous section may give the impression that these counters are 1979hard to use, but as we will show now, in practice it is quite simple. 1980 1981 With the information from the '-h' option, we can easily set up our 1982first event counter experiment. 1983 1984 We start by using the default set of counters defined for our 1985processor and we use 2 threads: 1986 1987 $ exe=mxv-pthreads.exe 1988 $ m=3000 1989 $ n=2000 1990 $ exp=mxv.hwc.def.2.thr.er 1991 $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2 1992 1993 The new option here is '-h auto'. The 'auto' keyword enables 1994hardware event counter profiling and selects the default set of counters 1995defined for this processor. 1996 1997 As before, we can display the information, but there is one practical 1998hurdle to take. Unless we like to view all metrics recorded, we would 1999need to know the names of the events that have been enabled. This is 2000tedious and also not portable in case we would like to repeat this 2001experiment on another processor. 2002 2003 This is where the special 'hwc' metric comes very handy. It 2004automatically expands to the active set of events used. 2005 2006 With this, it is very easy to display the event counter values. Note 2007that although the regular clock based profiling was enabled, we only 2008want to see the counter values. We also request to see the percentages 2009and limit the output to the first 5 lines: 2010 2011 $ exp=mxv.hwc.def.2.thr.er 2012 $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp 2013 2014 Current metrics: e.%cycles:e+%insts:e+%llm:name 2015 Current Sort Metric: Exclusive CPU Cycles ( e.%cycles ) 2016 Print limit set to 5 2017 Functions sorted by metric: Exclusive CPU Cycles 2018 2019 Excl. CPU Excl. Instructions Excl. Last-Level Name 2020 Cycles Executed Cache Misses 2021 sec. % % % 2022 2.691 100.00 7906475309 100.00 122658983 100.00 <Total> 2023 2.598 96.54 7432724378 94.01 121745696 99.26 mxv_core 2024 0.035 1.31 188860269 2.39 70084 0.06 erand48_r 2025 0.026 0.95 73623396 0.93 763116 0.62 init_data 2026 0.018 0.66 76824434 0.97 40040 0.03 drand48 2027 2028 As we have seen before, the first few lines echo the settings. This 2029includes a list with the hardware event counters used by default. 2030 2031 The table that follows makes it very easy to get an overview where 2032the time is spent and how many of the target events have occurred. 2033 2034 As before, we can drill down deeper and see the same metrics at the 2035source line and instruction level. Other than using 'hwc' in the 2036metrics definitions, nothing has changed compared to the previous 2037examples: 2038 2039 $ exp=mxv.hwc.def.2.thr.er 2040 $ gprofng display text -metrics e.hwc -source mxv_core $exp 2041 2042 This is the relevant part of the output. Since the lines get very 2043long, we have somewhat modified the lay-out: 2044 2045 Excl. CPU Excl. Excl. 2046 Cycles Instructions Last-Level 2047 sec. Executed Cache Misses 2048 <Function: mxv_core> 2049 0. 0 0 32. void __attribute__ ((noinline)) 2050 mxv_core(...) 2051 0. 0 0 33. { 2052 0. 0 0 34. for (uint64_t i=...) { 2053 0. 0 0 35. double row_sum = 0.0; 2054 ## 1.872 7291879319 88150571 36. for (int64_t j=0; j<n; j++) 2055 0.725 140845059 33595125 37. row_sum += A[i][j]*b[j]; 2056 0. 0 0 38. c[i] = row_sum; 2057 39. } 2058 0. 0 0 40. } 2059 2060 In a smiliar way we can display the event counter values at the 2061instruction level. Again we have modified the lay-out due to page width 2062limitations: 2063 2064 $ exp=mxv.hwc.def.2.thr.er 2065 $ gprofng display text -metrics e.hwc -disasm mxv_core $exp 2066 2067 Excl. CPU Excl. Excl. 2068 Cycles Instructions Last-Level 2069 sec. Executed Cache Misses 2070 <Function: mxv_core> 2071 0. 0 0 [33] 4021ba: mov 0x8(%rsp),%r10 2072 34. for (uint64_t i=...) { 2073 0. 0 0 [34] 4021bf: cmp %rsi,%rdi 2074 0. 0 0 [34] 4021c2: jbe 0x37 2075 0. 0 0 [34] 4021c4: ret 2076 35. double row_sum = 0.0; 2077 36. for (int64_t j=0; j<n; j++) 2078 37. row_sum += A[i][j]*b[j]; 2079 0. 0 0 [37] 4021c5: mov (%r8,%rdi,8),%rdx 2080 0. 0 0 [36] 4021c9: mov $0x0,%eax 2081 0. 0 0 [35] 4021ce: pxor %xmm1,%xmm1 2082 0.002 12804230 321394 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0 2083 0.141 60819025 3866677 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0 2084 0.582 67221804 29407054 [37] 4021dd: addsd %xmm0,%xmm1 2085 ## 1.871 7279075109 87989870 [36] 4021e1: add $0x1,%rax 2086 0.002 12804210 80351 [36] 4021e5: cmp %rax,%rcx 2087 0. 0 0 [36] 4021e8: jne 0xffffffffffffffea 2088 38. c[i] = row_sum; 2089 0. 0 0 [38] 4021ea: movsd %xmm1,(%r10,%rdi,8) 2090 0. 0 0 [34] 4021f0: add $0x1,%rdi 2091 0. 0 0 [34] 4021f4: cmp %rdi,%rsi 2092 0. 0 0 [34] 4021f7: jb 0xd 2093 0. 0 0 [35] 4021f9: pxor %xmm1,%xmm1 2094 0. 0 0 [36] 4021fd: test %rcx,%rcx 2095 0. 0 80350 [36] 402200: jne 0xffffffffffffffc5 2096 0. 0 0 [36] 402202: jmp 0xffffffffffffffe8 2097 39. } 2098 40. } 2099 0. 0 0 [40] 402204: ret 2100 2101 So far we have used the default settings for the event counters. It 2102is quite straightforward to select specific counters. For sake of the 2103example, let's assume we would like to count how many branch 2104instructions and retired memory load instructions that missed in the L1 2105cache have been executed. We also want to count these events with a 2106high resolution. 2107 2108 This is the command to do so: 2109 2110 $ exe=mxv-pthreads.exe 2111 $ m=3000 2112 $ n=2000 2113 $ exp=mxv.hwc.sel.2.thr.er 2114 $ hwc1=br_ins,hi 2115 $ hwc2=mem_load_retired.l1_miss,hi 2116 $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2 2117 2118 As before, we get a table with the event counts. Due to the very 2119long name for the second counter, we have somewhat modified the output. 2120 2121 $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er 2122 2123 Functions sorted by metric: Exclusive Total CPU Time 2124 Excl. Incl. Excl. Branch Excl. Name 2125 Total Total Instructions mem_load_retired.l1_miss 2126 CPU sec. CPU sec. Events 2127 2.597 2.597 1305305319 4021340 <Total> 2128 2.481 2.481 1233233242 3982327 mxv_core 2129 0.040 0.107 19019012 9003 init_data 2130 0.028 0.052 23023048 15006 erand48_r 2131 0.024 0.024 19019008 9004 __drand48_iterate 2132 0.015 0.067 11011009 2998 drand48 2133 0.008 0.010 0 3002 _int_malloc 2134 0.001 0.001 0 0 brk 2135 0.001 0.002 0 0 sysmalloc 2136 0. 0.001 0 0 __default_morecore 2137 2138 When using event counters, the values could be very large and it is 2139not easy to compare the numbers. As we will show next, the 'ratio' 2140feature is very useful when comparing such profiles. 2141 2142 To demonstrate this, we have set up another event counter experiment 2143where we would like to compare the number of last level cache miss and 2144the number of branch instructions executed when using a single thread, 2145or two threads. 2146 2147 These are the commands used to generate the experiment directories: 2148 2149 $ exe=./mxv-pthreads.exe 2150 $ m=3000 2151 $ n=2000 2152 $ exp1=mxv.hwc.comp.1.thr.er 2153 $ exp2=mxv.hwc.comp.2.thr.er 2154 $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1 2155 $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2 2156 2157 The following script has been used to get the tables. Due to lay-out 2158restrictions, we have to create two tables, one for each counter. 2159 2160 # Limit the output to 5 lines 2161 limit 5 2162 # Define the metrics 2163 metrics name:e.llm 2164 # Set the comparison to ratio 2165 compare ratio 2166 functions 2167 # Define the metrics 2168 metrics name:e.br_ins 2169 # Set the comparison to ratio 2170 compare ratio 2171 functions 2172 2173 Note that we print the name of the function first, followed by the 2174counter data. The new element is that we set the comparison mode to 2175'ratio'. This divides the data in a column by its counterpart in the 2176reference experiment. 2177 2178 This is the command using this script and the two experiment 2179directories as input: 2180 2181 $ gprofng display text -script my-script-comp-counters \ 2182 mxv.hwc.comp.1.thr.er \ 2183 mxv.hwc.comp.2.thr.er 2184 2185 By design, we get two tables, one for each counter: 2186 2187 Functions sorted by metric: Exclusive Last-Level Cache Misses 2188 2189 mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er 2190 Name Excl. Last-Level Excl. Last-Level 2191 Cache Misses Cache Misses 2192 ratio 2193 <Total> 122709276 x 0.788 2194 mxv_core 121796001 x 0.787 2195 init_data 723064 x 1.055 2196 erand48_r 100111 x 0.500 2197 drand48 60065 x 1.167 2198 2199 Functions sorted by metric: Exclusive Branch Instructions 2200 2201 mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er 2202 Name Excl. Branch Excl. Branch 2203 Instructions Instructions 2204 ratio 2205 <Total> 1307307316 x 0.997 2206 mxv_core 1235235239 x 0.997 2207 erand48_r 23023033 x 0.957 2208 drand48 20020009 x 0.600 2209 __drand48_iterate 17017028 x 0.882 2210 2211 A ratio less than one in the second column, means that this counter 2212value was smaller than the value from the reference experiment shown in 2213the first column. 2214 2215 This kind of presentation of the results makes it much easier to 2216quickly interpret the data. 2217 2218 We conclude this section with thread-level event counter overviews, 2219but before we go into this, there is an important metric we need to 2220mention. 2221 2222 In case it is known how many instructions and CPU cycles have been 2223executed, the value for the IPC ("Instructions Per Clockycle") can be 2224computed. *Note Hardware Event Counters Explained::. This is a derived 2225metric that gives an indication how well the processor is utilized. The 2226inverse of the IPC is called CPI. 2227 2228 The 'gprofng display text' command automatically computes the IPC and 2229CPI values if an experiment contains the event counter values for the 2230instructions and CPU cycles executed. These are part of the metric list 2231and can be displayed, just like any other metric. 2232 2233 This can be verified through the 'metric_list' command. If we go 2234back to our earlier experiment with the default event counters, we get 2235the following result. 2236 2237 $ gprofng display text -metric_list mxv.hwc.def.2.thr.er 2238 2239 Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name 2240 Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) 2241 Available metrics: 2242 Exclusive Total CPU Time: e.%totalcpu 2243 Inclusive Total CPU Time: i.%totalcpu 2244 Exclusive CPU Cycles: e.+%cycles 2245 Inclusive CPU Cycles: i.+%cycles 2246 Exclusive Instructions Executed: e+%insts 2247 Inclusive Instructions Executed: i+%insts 2248 Exclusive Last-Level Cache Misses: e+%llm 2249 Inclusive Last-Level Cache Misses: i+%llm 2250 Exclusive Instructions Per Cycle: e+IPC 2251 Inclusive Instructions Per Cycle: i+IPC 2252 Exclusive Cycles Per Instruction: e+CPI 2253 Inclusive Cycles Per Instruction: i+CPI 2254 Size: size 2255 PC Address: address 2256 Name: name 2257 2258 Among the other metrics, we see the new metrics for the IPC and CPI 2259listed. 2260 2261 In the script below, we use this information and add the IPC and CPI 2262to the metrics to be displayed. We also use a the thread filter to 2263display these values for the individual threads. 2264 2265 This is the complete script we have used. Other than a different 2266selection of the metrics, there are no new features. 2267 2268 # Define the metrics 2269 metrics e.insts:e.%cycles:e.IPC:e.CPI 2270 # Sort with respect to cycles 2271 sort e.cycles 2272 # Limit the output to 5 lines 2273 limit 5 2274 # Get the function overview for all threads 2275 functions 2276 # Get the function overview for thread 1 2277 thread_select 1 2278 functions 2279 # Get the function overview for thread 2 2280 thread_select 2 2281 functions 2282 # Get the function overview for thread 3 2283 thread_select 3 2284 functions 2285 2286 In the metrics definition on the second line, we explicitly request 2287the counter values for the instructions ('e.insts') and CPU cycles 2288('e.cycles') executed. These names can be found in output from the 2289'metric_list' commad above. In addition to these metrics, we also 2290request the IPC and CPI to be shown. 2291 2292 As before, we used the 'limit' command to control the number of 2293functions displayed. We then request an overview for all the threads, 2294followed by three sets of two commands to select a thread and display 2295the function overview. 2296 2297 The script above is used as follows: 2298 2299 $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er 2300 2301 This script produces four tables. We list them separately below, and 2302have left out the additional output. 2303 2304 The first table shows the accumulated values across the three threads 2305that have been active. 2306 2307 Functions sorted by metric: Exclusive CPU Cycles 2308 2309 Excl. Excl. CPU Excl. Excl. Name 2310 Instructions Cycles IPC CPI 2311 Executed sec. % 2312 7906475309 2.691 100.00 1.473 0.679 <Total> 2313 7432724378 2.598 96.54 1.434 0.697 mxv_core 2314 188860269 0.035 1.31 2.682 0.373 erand48_r 2315 73623396 0.026 0.95 1.438 0.696 init_data 2316 76824434 0.018 0.66 2.182 0.458 drand48 2317 2318 This shows that IPC of this program is completely dominated by 2319function 'mxv_core'. It has a fairly low IPC value of 1.43. 2320 2321 The next table is for thread 1 and shows the values for the main 2322thread. 2323 2324 Exp Sel Total 2325 === === ===== 2326 1 1 3 2327 Functions sorted by metric: Exclusive CPU Cycles 2328 2329 Excl. Excl. CPU Excl. Excl. Name 2330 Instructions Cycles IPC CPI 2331 Executed sec. % 2332 473750931 0.093 100.00 2.552 0.392 <Total> 2333 188860269 0.035 37.93 2.682 0.373 erand48_r 2334 73623396 0.026 27.59 1.438 0.696 init_data 2335 76824434 0.018 18.97 2.182 0.458 drand48 2336 134442832 0.013 13.79 5.250 0.190 __drand48_iterate 2337 2338 Although this thread hardly uses any CPU cycles, the overall IPC of 23392.55 is not all that bad. 2340 2341 Last, we show the tables for threads 2 and 3: 2342 2343 Exp Sel Total 2344 === === ===== 2345 1 2 3 2346 Functions sorted by metric: Exclusive CPU Cycles 2347 2348 Excl. Excl. CPU Excl. Excl. Name 2349 Instructions Cycles IPC CPI 2350 Executed sec. % 2351 3716362189 1.298 100.00 1.435 0.697 <Total> 2352 3716362189 1.298 100.00 1.435 0.697 mxv_core 2353 0 0. 0. 0. 0. collector_root 2354 0 0. 0. 0. 0. driver_mxv 2355 2356 Exp Sel Total 2357 === === ===== 2358 1 3 3 2359 Functions sorted by metric: Exclusive CPU Cycles 2360 2361 Excl. Excl. CPU Excl. Excl. Name 2362 Instructions Cycles IPC CPI 2363 Executed sec. % 2364 3716362189 1.300 100.00 1.433 0.698 <Total> 2365 3716362189 1.300 100.00 1.433 0.698 mxv_core 2366 0 0. 0. 0. 0. collector_root 2367 0 0. 0. 0. 0. driver_mxv 2368 2369 It is seen that both execute the same number of instructions and take 2370about the same number of CPU cycles. As a result, the IPC is the same 2371for both threads. 2372 2373 2374File: gprofng.info, Node: Java Profiling, Prev: Profile Hardware Event Counters, Up: A Mini Tutorial 2375 23763.5 Java Profiling 2377================== 2378 2379The 'gprofng collect app' command supports Java profiling. The '-j on' 2380option can be used for this, but since this feature is enabled by 2381default, there is no need to set this explicitly. Java profiling may be 2382disabled through the '-j off' option. 2383 2384 The program is compiled as usual and the experiment directory is 2385created similar to what we have seen before. The only difference with a 2386C/C++ application is that the program has to be explicitly executed by 2387java. 2388 2389 For example, this is how to generate the experiment data for a Java 2390program that has the source code stored in file 'Pi.java': 2391 2392 $ javac Pi.java 2393 $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in 2394 2395 Regarding which java is selected to generate the data, 'gprofng' 2396first looks for the JDK in the path set in either the 'JDK_HOME' 2397environment variable, or in the 'JAVA_PATH' environment variable. If 2398neither of these variables is set, it checks for a JDK in the search 2399path (set in the PATH environment variable). If there is no JDK in this 2400path, it checks for the java executable in '/usr/java/bin/java'. 2401 2402 In case additional options need to be passed on to the JVM, the '-J 2403<string>' option can be used. The string with the option(s) has to be 2404delimited by quotation marks in case there is more than one argument. 2405 2406 The 'gprofng display text' command may be used to view the 2407performance data. There is no need for any special options and the same 2408commands as previously discussed are supported. 2409 2410 The 'viewmode' command *Note The Viewmode:: is very useful to examine 2411the call stacks. 2412 2413 For example, this is how one can see the native call stacks. For 2414lay-out purposes we have restricted the list to the first five entries: 2415 2416 $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er 2417 2418 Print limit set to 5 2419 Viewmode set to machine 2420 Functions Call Tree. Metric: Attributed Total CPU Time 2421 2422 Attr. Name 2423 Total 2424 CPU sec. 2425 1.381 +-<Total> 2426 1.171 +-Pi.calculatePi(double) 2427 0.110 +-collector_root 2428 0.110 | +-JavaMain 2429 0.070 | +-jni_CallStaticVoidMethod 2430 2431Note that the selection of the viewmode is echoed in the output. 2432 2433 2434File: gprofng.info, Node: Terminology, Next: Other Document Formats, Prev: A Mini Tutorial, Up: Top 2435 24364 Terminology 2437************* 2438 2439Throughout this manual, certain terminology specific to profiling tools, 2440or 'gprofng', or even to this document only, is used. In this chapter 2441we explain this terminology in detail. 2442 2443* Menu: 2444 2445* The Program Counter:: What is a Program Counter? 2446* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics. 2447* Metric Definitions:: Definitions associated with metrics. 2448* The Viewmode:: Select the way call stacks are presented. 2449* The Selection List:: How to define a selection. 2450* Load Objects and Functions:: The components in an application. 2451* The Concept of a CPU in gprofng:: The definition of a CPU. 2452* Hardware Event Counters Explained:: What are event counters? 2453* apath:: Our generic definition of a path. 2454 2455 2456File: gprofng.info, Node: The Program Counter, Next: Inclusive and Exclusive Metrics, Up: Terminology 2457 24584.1 The Program Counter 2459======================= 2460 2461The _Program Counter_, or PC for short, keeps track where program 2462execution is. The address of the next instruction to be executed is 2463stored in a special purpose register in the processor, or core. 2464 2465 The PC is sometimes also referred to as the _instruction pointer_, 2466but we will use Program Counter or PC throughout this document. 2467 2468 2469File: gprofng.info, Node: Inclusive and Exclusive Metrics, Next: Metric Definitions, Prev: The Program Counter, Up: Terminology 2470 24714.2 Inclusive and Exclusive Metrics 2472=================================== 2473 2474In the remainder, these two concepts occur quite often and for lack of a 2475better place, they are explained here. 2476 2477 The _inclusive_ value for a metric includes all values that are part 2478of the dynamic extent of the target function. For example if function 2479'A' calls functions 'B' and 'C', the inclusive CPU time for 'A' includes 2480the CPU time spent in 'B' and 'C'. 2481 2482 In contrast with this, the _exclusive_ value for a metric is computed 2483by excluding the metric values used by other functions called. In our 2484imaginary example, the exclusive CPU time for function 'A' is the time 2485spent outside calling functions 'B' and 'C'. 2486 2487 In case of a _leaf function_, the inclusive and exclusive values for 2488the metric are the same since by definition, it is not calling any other 2489function(s). 2490 2491 Why do we use these two different values? The inclusive metric shows 2492the most expensive path, in terms of this metric, in the application. 2493For example, if the metric is cache misses, the function with the 2494highest inclusive metric tells you where most of the cache misses come 2495from. 2496 2497 Within this branch of the application, the exclusive metric points to 2498the functions that contribute and help to identify which part(s) to 2499consider for further analysis. 2500 2501 2502File: gprofng.info, Node: Metric Definitions, Next: The Viewmode, Prev: Inclusive and Exclusive Metrics, Up: Terminology 2503 25044.3 Metric Definitions 2505====================== 2506 2507The metrics to be shown are highly customizable. In this section we 2508explain the definitions associated with metrics. 2509 2510 The 'metrics' command takes a colon (:) separated list with special 2511keywords. This keyword consists of the following three fields: 2512'<flavor>''<visibility>''<metric_name>'. 2513 2514 The _<flavor>_ field is either an 'e' for "exclusive", or 'i' for 2515"inclusive". The '<metric_name>' field is the name of the metric 2516request. The _<visibility>_ field consists of one ore more characters 2517from the following table: 2518 2519'.' 2520 Show the metric as time. This applies to timing metrics and 2521 hardware event counters that measure cycles. Interpret as '+' for 2522 other metrics. 2523 2524'%' 2525 Show the metric as a percentage of the total value for this metric. 2526 2527'+' 2528 Show the metric as an absolute value. For hardware event counters 2529 this is the event count. Interpret as '.' for timing metrics. 2530 2531'|' 2532 Do not show any metric value. Cannot be used with other visibility 2533 characters. 2534 2535 2536File: gprofng.info, Node: The Viewmode, Next: The Selection List, Prev: Metric Definitions, Up: Terminology 2537 25384.4 The Viewmode 2539================ 2540 2541There are different ways to view a call stack in Java. In 'gprofng', 2542this is called the _viewmode_ and the setting is controlled through a 2543command with the same name. 2544 2545 The 'viewmode' command takes one of the following keywords: 2546 2547'user' 2548 This is the default and shows the Java call stacks for Java 2549 threads. No call stacks for any housekeeping threads are shown. 2550 The function list contains a function '<JVM-System>' that 2551 represents the aggregated time from non-Java threads. When the JVM 2552 software does not report a Java call stack, time is reported 2553 against the function '<no Java callstack recorded>'. 2554 2555'expert' 2556 Show the Java call stacks for Java threads when the Java code from 2557 the user is executed and machine call stacks when JVM code is 2558 executed, or when the JVM software does not report a Java call 2559 stack. Show the machine call stacks for housekeeping threads. 2560 2561'machine' 2562 Show the actual native call stacks for all threads. 2563 2564 2565File: gprofng.info, Node: The Selection List, Next: Load Objects and Functions, Prev: The Viewmode, Up: Terminology 2566 25674.5 The Selection List 2568====================== 2569 2570Several commands allow the user to specify a subset of a list. For 2571example, to select specific threads from all the threads that have been 2572used when conducting the experiment(s). 2573 2574 Such a selection list (or "list" in the remainder of this section) 2575can be a single number, a contiguous range of numbers with the start and 2576end numbers separated by a hyphen ('-'), a comma-separated list of 2577numbers and ranges, or the 'all' keyword. Lists must not contain 2578spaces. 2579 2580 Each list can optionally be preceded by an experiment list with a 2581similar format, separated from the list by a colon (:). If no 2582experiment list is included, the list applies to all experiments. 2583 2584 Multiple lists can be concatenated by separating the individual lists 2585by a plus sign. 2586 2587 These are some examples of various filters using a list: 2588 2589'thread_select 1' 2590 Select thread 1 from all experiments. 2591 2592'thread_select all:1' 2593 Select thread 1 from all experiments. 2594 2595'thread_select 1:1+2:2' 2596 Select thread 1 from experiment 1 and thread 2 from experiment 2. 2597 2598'cpu_select all:1,3,5' 2599 Selects cores 1, 3, and 5 from all experiments. 2600 2601'cpu_select 1,2:all' 2602 Select all cores from experiments 1 and 2, as listed by the 'by 2603 exp_list' command. 2604 2605 2606File: gprofng.info, Node: Load Objects and Functions, Next: The Concept of a CPU in gprofng, Prev: The Selection List, Up: Terminology 2607 26084.6 Load Objects and Functions 2609============================== 2610 2611An application consists of various components. The source code files 2612are compiled into object files. These are then glued together at link 2613time to form the executable. During execution, the program may also 2614dynamically load objects. 2615 2616 A _load object_ is defined to be an executable, or shared object. A 2617shared library is an example of a load object in 'gprofng'. 2618 2619 Each load object, contains a text section with the instructions 2620generated by the compiler, a data section for data, and various symbol 2621tables. All load objects must contain an ELF symbol table, which gives 2622the names and addresses of all the globally known functions in that 2623object. 2624 2625 Load objects compiled with the -g option contain additional symbolic 2626information that can augment the ELF symbol table and provide 2627information about functions that are not global, additional information 2628about object modules from which the functions came, and line number 2629information relating addresses to source lines. 2630 2631 The term _function_ is used to describe a set of instructions that 2632represent a high-level operation described in the source code. The term 2633also covers methods as used in C++ and in the Java programming language. 2634 2635 In the 'gprofng' context, functions are provided in source code 2636format. Normally their names appear in the symbol table representing a 2637set of addresses. If the Program Counter (PC) is within that set, the 2638program is executing within that function. 2639 2640 In principle, any address within the text segment of a load object 2641can be mapped to a function. Exactly the same mapping is used for the 2642leaf PC and all the other PCs on the call stack. 2643 2644 Most of the functions correspond directly to the source model of the 2645program, but there are exceptions. This topic is however outside of the 2646scope of this guide. 2647 2648 2649File: gprofng.info, Node: The Concept of a CPU in gprofng, Next: Hardware Event Counters Explained, Prev: Load Objects and Functions, Up: Terminology 2650 26514.7 The Concept of a CPU in gprofng 2652=================================== 2653 2654In gprofng, there is the concept of a CPU. Admittedly, this is not the 2655best word to describe what is meant here and may be replaced in the 2656future. 2657 2658 The word CPU is used in many of the displays. In the context of 2659gprofng, it is meant to denote a part of the processor that is capable 2660of executing instructions and with its own state, like the program 2661counter. 2662 2663 For example, on a contemporary processor, a CPU could be a core. In 2664case hardware threads are supported within a core, it could be one of 2665those hardware threads. 2666 2667 2668File: gprofng.info, Node: Hardware Event Counters Explained, Next: apath, Prev: The Concept of a CPU in gprofng, Up: Terminology 2669 26704.8 Hardware Event Counters Explained 2671===================================== 2672 2673For quite a number of years now, many microprocessors have supported 2674hardware event counters. 2675 2676 On the hardware side, this means that in the processor there are one 2677or more registers dedicated to count certain activities, or "events". 2678Examples of such events are the number of instructions executed, or the 2679number of cache misses at level 2 in the memory hierarchy. 2680 2681 While there is a limited set of such registers, the user can map 2682events onto them. In case more than one register is available, this 2683allows for the simultaenous measurement of various events. 2684 2685 A simple, yet powerful, example is to simultaneously count the number 2686of CPU cycles and the number of instructions excuted. These two numbers 2687can then be used to compute the _IPC_ value. IPC stands for 2688"Instructions Per Clockcycle" and each processor has a maximum. For 2689example, if this maximum number is 2, it means the processor is capable 2690of executing two instructions every clock cycle. 2691 2692 Whether this is actually achieved, depends on several factors, 2693including the instruction characteristics. However, in case the IPC 2694value is well below this maximum in a time critical part of the 2695application and this cannot be easily explained, further investigation 2696is probably warranted. 2697 2698 A related metric is called _CPI_, or "Clockcycles Per Instruction". 2699It is the inverse of the CPI and can be compared against the theoretical 2700value(s) of the target instruction(s). A significant difference may 2701point at a bottleneck. 2702 2703 One thing to keep in mind is that the value returned by a counter can 2704either be the number of times the event occured, or a CPU cycle count. 2705In case of the latter it is possible to convert this number to time. 2706 2707 This is often easier to interpret than a simple count, but there is 2708one caveat to keep in mind. The CPU frequency may not have been 2709constant while the experimen was recorded and this impacts the time 2710reported. 2711 2712 These event counters, or "counters" for short, provide great insight 2713into what happens deep inside the processor. In case higher level 2714information does not provide the insight needed, the counters provide 2715the information to get to the bottom of a performance problem. 2716 2717 There are some things to consider though. 2718 2719 * The event definitions and names vary across processors and it may 2720 even happen that some events change with an update. Unfortunately 2721 and this is luckily rare, there are sometimes bugs causing the 2722 wrong count to be returned. 2723 2724 In 'gprofng', some of the processor specific event names have an 2725 alias name. For example 'insts' measures the instructions 2726 executed. These aliases not only makes it easier to identify the 2727 functionality, but also provide portability of certain events 2728 across processors. 2729 2730 * Another complexity is that there are typically many events one can 2731 monitor. There may up to hundreds of events available and it could 2732 require several experiments to zoom in on the root cause of a 2733 performance problem. 2734 2735 * There may be restrictions regarding the mapping of event(s) onto 2736 the counters. For example, certain events may be restricted to 2737 specific counters only. As a result, one may have to conduct 2738 additional experiments to cover all the events of interest. 2739 2740 * The names of the events may also not be easy to interpret. In such 2741 cases, the description can be found in the architecture manual for 2742 the processor. 2743 2744 Despite these drawbacks, hardware event counters are extremely useful 2745and may even turn out to be indispensable. 2746 2747 2748File: gprofng.info, Node: apath, Prev: Hardware Event Counters Explained, Up: Terminology 2749 27504.9 What is <apath>? 2751==================== 2752 2753In most cases, 'gprofng' shows the absolute pathnames of directories. 2754These tend to be rather long, causing display issues in this document. 2755 2756 Instead of wrapping these long pathnames over multiple lines, we 2757decided to represent them by the '<apath>' symbol, which stands for "an 2758absolute pathname". 2759 2760 Note that different occurrences of '<apath>' may represent different 2761absolute pathnames. 2762 2763 2764File: gprofng.info, Node: Other Document Formats, Next: Index, Prev: Terminology, Up: Top 2765 27665 Other Document Formats 2767************************ 2768 2769This document is written in Texinfo and the source text is made 2770available as part of the binutils distribution. The file name is 2771'gprofng.texi' and can be found in subdirectory 'doc' under directory 2772'gprofng' in the top level directory. 2773 2774 This file can be used to generate the document in the 'info', 'html', 2775and 'pdf' formats. The default installation procedure creates a file in 2776the 'info' format and stores it in the documentation section of 2777binutils. 2778 2779 The probably easiest way to generate a different format from this 2780Texinfo document is to go to the distribution directory that was created 2781when the tools were built. This is either the default distribution 2782directory, or the one that has been set with the '--prefix' option as 2783part of the 'configure' command. In this example we symbolize this 2784location with '<dist>'. 2785 2786 The make file called 'Makefile' in directory '<dist>/gprofng/doc' 2787supports several commands to generate this document in different 2788formats. We recommend to use these commands. 2789 2790 They create the file(s) and install it in the documentation directory 2791of binutils, which is '<dist>/share/doc' in case 'html' or 'pdf' is 2792selected and '<dist>/share/info' for the file in the 'info' format. 2793 2794 To generate this document in the requested format and install it in 2795the documentation directory, the commands below should be executed. In 2796this notation, '<format>' is one of 'info', 'html', or 'pdf': 2797 2798 $ cd <dist>/gprofng/doc 2799 $ make install-<format> 2800 2801Some things to note: 2802 2803 * For the 'pdf' file to be generated, the TeX document formatting 2804 software is required and the relevant commmands need to be included 2805 in the search path. An example of a popular TeX implementation is 2806 _TexLive_. It is beyond the scope of this document to go into the 2807 details of installing and using TeX, but it is well documented 2808 elsewhere. 2809 2810 * Instead of generating a single file in the 'html' format, it is 2811 also possible to create a directory with individual files for the 2812 various chapters. To do so, remove the use of '--no-split' in 2813 variable 'MAKEINFOHTML' in the make file in the 'doc' directory. 2814 2815 * The make file also supports commands to only generate the file in 2816 the desired format and not move them to the documentation 2817 directory. This is accomplished through the 'make <format>' 2818 command. 2819 2820 2821File: gprofng.info, Node: Index, Prev: Other Document Formats, Up: Top 2822 2823Index 2824***** 2825 2826[index] 2827* Menu: 2828 2829* Command line mode: A First Profile. (line 39) 2830* Commands, calltree: The Call Tree. (line 10) 2831* Commands, compare delta: Comparison of Experiments. 2832 (line 52) 2833* Commands, compare on/off: Comparison of Experiments. 2834 (line 7) 2835* Commands, compare on/off <1>: Comparison of Experiments. 2836 (line 51) 2837* Commands, compare ratio: Comparison of Experiments. 2838 (line 52) 2839* Commands, compare ratio <1>: Examples Using Hardware Event Counters. 2840 (line 166) 2841* Commands, cpus: Commands Specific to Multithreading. 2842 (line 177) 2843* Commands, cpu_list: Commands Specific to Multithreading. 2844 (line 176) 2845* Commands, disasm: The Disassembly View. 2846 (line 8) 2847* Commands, experiment_list: Aggregation of Experiments. 2848 (line 21) 2849* Commands, fsingle: Information on Load Objects. 2850 (line 10) 2851* Commands, fsingle <1>: Information on Load Objects. 2852 (line 36) 2853* Commands, fsummary: Information on Load Objects. 2854 (line 10) 2855* Commands, fsummary <1>: Information on Load Objects. 2856 (line 62) 2857* Commands, functions: A First Profile. (line 44) 2858* Commands, header: More Information on the Experiment. 2859 (line 10) 2860* Commands, header <1>: Control the Sampling Frequency. 2861 (line 59) 2862* Commands, limit: Control the Number of Lines in the Output. 2863 (line 6) 2864* Commands, lines: The Source Code View. 2865 (line 68) 2866* Commands, metrics: Display and Define the Metrics. 2867 (line 11) 2868* Commands, metrics <1>: Display and Define the Metrics. 2869 (line 38) 2870* Commands, metrics <2>: Metric Definitions. (line 9) 2871* Commands, metric_list: Display and Define the Metrics. 2872 (line 10) 2873* Commands, metric_list <1>: Display and Define the Metrics. 2874 (line 19) 2875* Commands, metric_list <2>: Examples Using Hardware Event Counters. 2876 (line 261) 2877* Commands, objects: Information on Load Objects. 2878 (line 10) 2879* Commands, overview: More Information on the Experiment. 2880 (line 54) 2881* Commands, pcs: The Disassembly View. 2882 (line 73) 2883* Commands, script: Scripting. (line 11) 2884* Commands, sort: Sorting the Performance Data. 2885 (line 6) 2886* Commands, source: The Source Code View. 2887 (line 12) 2888* Commands, threads: Commands Specific to Multithreading. 2889 (line 35) 2890* Commands, thread_list: Commands Specific to Multithreading. 2891 (line 10) 2892* Commands, thread_select: Commands Specific to Multithreading. 2893 (line 64) 2894* Commands, thread_select <1>: Commands Specific to Multithreading. 2895 (line 88) 2896* Commands, viewmode: Java Profiling. (line 37) 2897* Commands, viewmode <1>: The Viewmode. (line 6) 2898* Compare experiments: Comparison of Experiments. 2899 (line 10) 2900* CPI: Hardware Event Counters Explained. 2901 (line 31) 2902* CPU: The Concept of a CPU in gprofng. 2903 (line 6) 2904* Default metrics: Display and Define the Metrics. 2905 (line 34) 2906* ELF: Load Objects and Functions. 2907 (line 16) 2908* Exclusive metric: Inclusive and Exclusive Metrics. 2909 (line 14) 2910* Experiment directory: Steps Needed to Create a Profile. 2911 (line 21) 2912* Filters, Thread selection: Commands Specific to Multithreading. 2913 (line 64) 2914* Flavor field: Metric Definitions. (line 13) 2915* Function: Load Objects and Functions. 2916 (line 26) 2917* gprofng display html: Steps Needed to Create a Profile. 2918 (line 46) 2919* gprofng display text: Steps Needed to Create a Profile. 2920 (line 42) 2921* Hardware event counters, alias name: Hardware Event Counters Explained. 2922 (line 57) 2923* Hardware event counters, auto option: Examples Using Hardware Event Counters. 2924 (line 21) 2925* Hardware event counters, counter definition: Getting Information on the Counters Supported. 2926 (line 85) 2927* Hardware event counters, description: Hardware Event Counters Explained. 2928 (line 6) 2929* Hardware event counters, hwc metric: Examples Using Hardware Event Counters. 2930 (line 31) 2931* Hardware event counters, IPC: Examples Using Hardware Event Counters. 2932 (line 250) 2933* Hardware event counters, variable CPU frequency: Hardware Event Counters Explained. 2934 (line 40) 2935* Inclusive metric: Inclusive and Exclusive Metrics. 2936 (line 9) 2937* Instruction level timings: The Disassembly View. 2938 (line 9) 2939* Instruction pointer: The Program Counter. (line 10) 2940* Interpreter mode: A First Profile. (line 30) 2941* IPC: Hardware Event Counters Explained. 2942 (line 20) 2943* Java profiling, -J <string>: Java Profiling. (line 29) 2944* Java profiling, -j on/off: Java Profiling. (line 6) 2945* Java profiling, <JVM-System>: The Viewmode. (line 15) 2946* Java profiling, <no Java callstack recorded>: The Viewmode. (line 18) 2947* Java profiling, different view modes: Java Profiling. (line 37) 2948* Java profiling, JAVA_PATH: Java Profiling. (line 24) 2949* Java profiling, JDK_HOME: Java Profiling. (line 23) 2950* Leaf function: Inclusive and Exclusive Metrics. 2951 (line 19) 2952* List specification: The Selection List. (line 6) 2953* Load object: Load Objects and Functions. 2954 (line 11) 2955* Load objects: Information on Load Objects. 2956 (line 11) 2957* Metric name field: Metric Definitions. (line 13) 2958* Miscellaneous , ##: The Source Code View. 2959 (line 63) 2960* Miscellaneous, <apath>: Information on Load Objects. 2961 (line 16) 2962* Miscellaneous, <Total>: A First Profile. (line 86) 2963* mxv-pthreads.exe: The Example Program. (line 12) 2964* Options, -C: More Information on the Experiment. 2965 (line 48) 2966* Options, -h: Getting Information on the Counters Supported. 2967 (line 9) 2968* Options, -h <1>: Examples Using Hardware Event Counters. 2969 (line 21) 2970* Options, -o: Name the Experiment Directory. 2971 (line 15) 2972* Options, -O: Name the Experiment Directory. 2973 (line 19) 2974* Options, -O <1>: A More Elaborate Example. 2975 (line 12) 2976* Options, -p: The Call Tree. (line 66) 2977* Options, -p <1>: Control the Sampling Frequency. 2978 (line 18) 2979* PC: The Program Counter. (line 6) 2980* PC <1>: Load Objects and Functions. 2981 (line 32) 2982* PC sampling: Sampling versus Tracing. 2983 (line 7) 2984* Posix Threads: The Example Program. (line 8) 2985* Program Counter: The Program Counter. (line 6) 2986* Program Counter <1>: Load Objects and Functions. 2987 (line 32) 2988* Program Counter sampling: Sampling versus Tracing. 2989 (line 7) 2990* Pthreads: The Example Program. (line 8) 2991* Sampling interval: Control the Sampling Frequency. 2992 (line 20) 2993* Selection list: The Selection List. (line 6) 2994* Source level timings: The Source Code View. 2995 (line 10) 2996* TeX: Other Document Formats. 2997 (line 40) 2998* Thread affinity: Commands Specific to Multithreading. 2999 (line 162) 3000* Total CPU time: A First Profile. (line 74) 3001* Viewmode: The Viewmode. (line 6) 3002* Visibility field: Sorting the Performance Data. 3003 (line 9) 3004* Visibility field <1>: Metric Definitions. (line 13) 3005 3006 3007 3008Tag Table: 3009Node: Top750 3010Node: Introduction3075 3011Node: Overview4352 3012Node: Main Features4993 3013Node: Sampling versus Tracing6733 3014Node: Steps Needed to Create a Profile9124 3015Node: A Mini Tutorial11311 3016Node: Getting Started12030 3017Node: The Example Program13680 3018Node: A First Profile14800 3019Node: The Source Code View19305 3020Node: The Disassembly View23801 3021Node: Display and Define the Metrics28490 3022Node: A First Customization of the Output30326 3023Node: Name the Experiment Directory32175 3024Node: Control the Number of Lines in the Output33272 3025Node: Sorting the Performance Data34031 3026Node: Scripting34973 3027Node: A More Elaborate Example35892 3028Node: The Call Tree38682 3029Node: More Information on the Experiment41423 3030Node: Control the Sampling Frequency44722 3031Node: Information on Load Objects47143 3032Node: Support for Multithreading50782 3033Node: Creating a Multithreading Experiment51422 3034Node: Commands Specific to Multithreading52935 3035Node: Viewing Multiple Experiments62341 3036Node: Aggregation of Experiments62918 3037Node: Comparison of Experiments65232 3038Node: Profile Hardware Event Counters69848 3039Node: Getting Information on the Counters Supported70556 3040Node: Examples Using Hardware Event Counters78156 3041Node: Java Profiling95853 3042Node: Terminology98222 3043Node: The Program Counter99234 3044Node: Inclusive and Exclusive Metrics99726 3045Node: Metric Definitions101179 3046Node: The Viewmode102364 3047Node: The Selection List103503 3048Node: Load Objects and Functions104905 3049Node: The Concept of a CPU in gprofng106919 3050Node: Hardware Event Counters Explained107680 3051Node: apath111472 3052Node: Other Document Formats112007 3053Node: Index114531 3054 3055End Tag Table 3056 3057 3058Local Variables: 3059coding: utf-8 3060End: 3061