1# User Guide 2 3## Command Line 4 5[Output Formats](#output-formats) 6 7[Output Files](#output-files) 8 9[Running Benchmarks](#running-benchmarks) 10 11[Running a Subset of Benchmarks](#running-a-subset-of-benchmarks) 12 13[Result Comparison](#result-comparison) 14 15[Extra Context](#extra-context) 16 17## Library 18 19[Runtime and Reporting Considerations](#runtime-and-reporting-considerations) 20 21[Setup/Teardown](#setupteardown) 22 23[Passing Arguments](#passing-arguments) 24 25[Custom Benchmark Name](#custom-benchmark-name) 26 27[Calculating Asymptotic Complexity](#asymptotic-complexity) 28 29[Templated Benchmarks](#templated-benchmarks) 30 31[Templated Benchmarks that take arguments](#templated-benchmarks-with-arguments) 32 33[Fixtures](#fixtures) 34 35[Custom Counters](#custom-counters) 36 37[Multithreaded Benchmarks](#multithreaded-benchmarks) 38 39[CPU Timers](#cpu-timers) 40 41[Manual Timing](#manual-timing) 42 43[Setting the Time Unit](#setting-the-time-unit) 44 45[Random Interleaving](random_interleaving.md) 46 47[User-Requested Performance Counters](perf_counters.md) 48 49[Preventing Optimization](#preventing-optimization) 50 51[Reporting Statistics](#reporting-statistics) 52 53[Custom Statistics](#custom-statistics) 54 55[Memory Usage](#memory-usage) 56 57[Using RegisterBenchmark](#using-register-benchmark) 58 59[Exiting with an Error](#exiting-with-an-error) 60 61[A Faster `KeepRunning` Loop](#a-faster-keep-running-loop) 62 63## Benchmarking Tips 64 65[Disabling CPU Frequency Scaling](#disabling-cpu-frequency-scaling) 66 67[Reducing Variance in Benchmarks](reducing_variance.md) 68 69<a name="output-formats" /> 70 71## Output Formats 72 73The library supports multiple output formats. Use the 74`--benchmark_format=<console|json|csv>` flag (or set the 75`BENCHMARK_FORMAT=<console|json|csv>` environment variable) to set 76the format type. `console` is the default format. 77 78The Console format is intended to be a human readable format. By default 79the format generates color output. Context is output on stderr and the 80tabular data on stdout. Example tabular output looks like: 81 82``` 83Benchmark Time(ns) CPU(ns) Iterations 84---------------------------------------------------------------------- 85BM_SetInsert/1024/1 28928 29349 23853 133.097kB/s 33.2742k items/s 86BM_SetInsert/1024/8 32065 32913 21375 949.487kB/s 237.372k items/s 87BM_SetInsert/1024/10 33157 33648 21431 1.13369MB/s 290.225k items/s 88``` 89 90The JSON format outputs human readable json split into two top level attributes. 91The `context` attribute contains information about the run in general, including 92information about the CPU and the date. 93The `benchmarks` attribute contains a list of every benchmark run. Example json 94output looks like: 95 96```json 97{ 98 "context": { 99 "date": "2015/03/17-18:40:25", 100 "num_cpus": 40, 101 "mhz_per_cpu": 2801, 102 "cpu_scaling_enabled": false, 103 "build_type": "debug" 104 }, 105 "benchmarks": [ 106 { 107 "name": "BM_SetInsert/1024/1", 108 "iterations": 94877, 109 "real_time": 29275, 110 "cpu_time": 29836, 111 "bytes_per_second": 134066, 112 "items_per_second": 33516 113 }, 114 { 115 "name": "BM_SetInsert/1024/8", 116 "iterations": 21609, 117 "real_time": 32317, 118 "cpu_time": 32429, 119 "bytes_per_second": 986770, 120 "items_per_second": 246693 121 }, 122 { 123 "name": "BM_SetInsert/1024/10", 124 "iterations": 21393, 125 "real_time": 32724, 126 "cpu_time": 33355, 127 "bytes_per_second": 1199226, 128 "items_per_second": 299807 129 } 130 ] 131} 132``` 133 134The CSV format outputs comma-separated values. The `context` is output on stderr 135and the CSV itself on stdout. Example CSV output looks like: 136 137``` 138name,iterations,real_time,cpu_time,bytes_per_second,items_per_second,label 139"BM_SetInsert/1024/1",65465,17890.7,8407.45,475768,118942, 140"BM_SetInsert/1024/8",116606,18810.1,9766.64,3.27646e+06,819115, 141"BM_SetInsert/1024/10",106365,17238.4,8421.53,4.74973e+06,1.18743e+06, 142``` 143 144<a name="output-files" /> 145 146## Output Files 147 148Write benchmark results to a file with the `--benchmark_out=<filename>` option 149(or set `BENCHMARK_OUT`). Specify the output format with 150`--benchmark_out_format={json|console|csv}` (or set 151`BENCHMARK_OUT_FORMAT={json|console|csv}`). Note that the 'csv' reporter is 152deprecated and the saved `.csv` file 153[is not parsable](https://github.com/google/benchmark/issues/794) by csv 154parsers. 155 156Specifying `--benchmark_out` does not suppress the console output. 157 158<a name="running-benchmarks" /> 159 160## Running Benchmarks 161 162Benchmarks are executed by running the produced binaries. Benchmarks binaries, 163by default, accept options that may be specified either through their command 164line interface or by setting environment variables before execution. For every 165`--option_flag=<value>` CLI switch, a corresponding environment variable 166`OPTION_FLAG=<value>` exist and is used as default if set (CLI switches always 167 prevails). A complete list of CLI options is available running benchmarks 168 with the `--help` switch. 169 170<a name="running-a-subset-of-benchmarks" /> 171 172## Running a Subset of Benchmarks 173 174The `--benchmark_filter=<regex>` option (or `BENCHMARK_FILTER=<regex>` 175environment variable) can be used to only run the benchmarks that match 176the specified `<regex>`. For example: 177 178```bash 179$ ./run_benchmarks.x --benchmark_filter=BM_memcpy/32 180Run on (1 X 2300 MHz CPU ) 1812016-06-25 19:34:24 182Benchmark Time CPU Iterations 183---------------------------------------------------- 184BM_memcpy/32 11 ns 11 ns 79545455 185BM_memcpy/32k 2181 ns 2185 ns 324074 186BM_memcpy/32 12 ns 12 ns 54687500 187BM_memcpy/32k 1834 ns 1837 ns 357143 188``` 189 190## Disabling Benchmarks 191 192It is possible to temporarily disable benchmarks by renaming the benchmark 193function to have the prefix "DISABLED_". This will cause the benchmark to 194be skipped at runtime. 195 196<a name="result-comparison" /> 197 198## Result comparison 199 200It is possible to compare the benchmarking results. 201See [Additional Tooling Documentation](tools.md) 202 203<a name="extra-context" /> 204 205## Extra Context 206 207Sometimes it's useful to add extra context to the content printed before the 208results. By default this section includes information about the CPU on which 209the benchmarks are running. If you do want to add more context, you can use 210the `benchmark_context` command line flag: 211 212```bash 213$ ./run_benchmarks --benchmark_context=pwd=`pwd` 214Run on (1 x 2300 MHz CPU) 215pwd: /home/user/benchmark/ 216Benchmark Time CPU Iterations 217---------------------------------------------------- 218BM_memcpy/32 11 ns 11 ns 79545455 219BM_memcpy/32k 2181 ns 2185 ns 324074 220``` 221 222You can get the same effect with the API: 223 224```c++ 225 benchmark::AddCustomContext("foo", "bar"); 226``` 227 228Note that attempts to add a second value with the same key will fail with an 229error message. 230 231<a name="runtime-and-reporting-considerations" /> 232 233## Runtime and Reporting Considerations 234 235When the benchmark binary is executed, each benchmark function is run serially. 236The number of iterations to run is determined dynamically by running the 237benchmark a few times and measuring the time taken and ensuring that the 238ultimate result will be statistically stable. As such, faster benchmark 239functions will be run for more iterations than slower benchmark functions, and 240the number of iterations is thus reported. 241 242In all cases, the number of iterations for which the benchmark is run is 243governed by the amount of time the benchmark takes. Concretely, the number of 244iterations is at least one, not more than 1e9, until CPU time is greater than 245the minimum time, or the wallclock time is 5x minimum time. The minimum time is 246set per benchmark by calling `MinTime` on the registered benchmark object. 247 248Furthermore warming up a benchmark might be necessary in order to get 249stable results because of e.g caching effects of the code under benchmark. 250Warming up means running the benchmark a given amount of time, before 251results are actually taken into account. The amount of time for which 252the warmup should be run can be set per benchmark by calling 253`MinWarmUpTime` on the registered benchmark object or for all benchmarks 254using the `--benchmark_min_warmup_time` command-line option. Note that 255`MinWarmUpTime` will overwrite the value of `--benchmark_min_warmup_time` 256for the single benchmark. How many iterations the warmup run of each 257benchmark takes is determined the same way as described in the paragraph 258above. Per default the warmup phase is set to 0 seconds and is therefore 259disabled. 260 261Average timings are then reported over the iterations run. If multiple 262repetitions are requested using the `--benchmark_repetitions` command-line 263option, or at registration time, the benchmark function will be run several 264times and statistical results across these repetitions will also be reported. 265 266As well as the per-benchmark entries, a preamble in the report will include 267information about the machine on which the benchmarks are run. 268 269<a name="setup-teardown" /> 270 271## Setup/Teardown 272 273Global setup/teardown specific to each benchmark can be done by 274passing a callback to Setup/Teardown: 275 276The setup/teardown callbacks will be invoked once for each benchmark. If the 277benchmark is multi-threaded (will run in k threads), they will be invoked 278exactly once before each run with k threads. 279 280If the benchmark uses different size groups of threads, the above will be true 281for each size group. 282 283Eg., 284 285```c++ 286static void DoSetup(const benchmark::State& state) { 287} 288 289static void DoTeardown(const benchmark::State& state) { 290} 291 292static void BM_func(benchmark::State& state) {...} 293 294BENCHMARK(BM_func)->Arg(1)->Arg(3)->Threads(16)->Threads(32)->Setup(DoSetup)->Teardown(DoTeardown); 295 296``` 297 298In this example, `DoSetup` and `DoTearDown` will be invoked 4 times each, 299specifically, once for each of this family: 300 - BM_func_Arg_1_Threads_16, BM_func_Arg_1_Threads_32 301 - BM_func_Arg_3_Threads_16, BM_func_Arg_3_Threads_32 302 303<a name="passing-arguments" /> 304 305## Passing Arguments 306 307Sometimes a family of benchmarks can be implemented with just one routine that 308takes an extra argument to specify which one of the family of benchmarks to 309run. For example, the following code defines a family of benchmarks for 310measuring the speed of `memcpy()` calls of different lengths: 311 312```c++ 313static void BM_memcpy(benchmark::State& state) { 314 char* src = new char[state.range(0)]; 315 char* dst = new char[state.range(0)]; 316 memset(src, 'x', state.range(0)); 317 for (auto _ : state) 318 memcpy(dst, src, state.range(0)); 319 state.SetBytesProcessed(int64_t(state.iterations()) * 320 int64_t(state.range(0))); 321 delete[] src; 322 delete[] dst; 323} 324BENCHMARK(BM_memcpy)->Arg(8)->Arg(64)->Arg(512)->Arg(4<<10)->Arg(8<<10); 325``` 326 327The preceding code is quite repetitive, and can be replaced with the following 328short-hand. The following invocation will pick a few appropriate arguments in 329the specified range and will generate a benchmark for each such argument. 330 331```c++ 332BENCHMARK(BM_memcpy)->Range(8, 8<<10); 333``` 334 335By default the arguments in the range are generated in multiples of eight and 336the command above selects [ 8, 64, 512, 4k, 8k ]. In the following code the 337range multiplier is changed to multiples of two. 338 339```c++ 340BENCHMARK(BM_memcpy)->RangeMultiplier(2)->Range(8, 8<<10); 341``` 342 343Now arguments generated are [ 8, 16, 32, 64, 128, 256, 512, 1024, 2k, 4k, 8k ]. 344 345The preceding code shows a method of defining a sparse range. The following 346example shows a method of defining a dense range. It is then used to benchmark 347the performance of `std::vector` initialization for uniformly increasing sizes. 348 349```c++ 350static void BM_DenseRange(benchmark::State& state) { 351 for(auto _ : state) { 352 std::vector<int> v(state.range(0), state.range(0)); 353 auto data = v.data(); 354 benchmark::DoNotOptimize(data); 355 benchmark::ClobberMemory(); 356 } 357} 358BENCHMARK(BM_DenseRange)->DenseRange(0, 1024, 128); 359``` 360 361Now arguments generated are [ 0, 128, 256, 384, 512, 640, 768, 896, 1024 ]. 362 363You might have a benchmark that depends on two or more inputs. For example, the 364following code defines a family of benchmarks for measuring the speed of set 365insertion. 366 367```c++ 368static void BM_SetInsert(benchmark::State& state) { 369 std::set<int> data; 370 for (auto _ : state) { 371 state.PauseTiming(); 372 data = ConstructRandomSet(state.range(0)); 373 state.ResumeTiming(); 374 for (int j = 0; j < state.range(1); ++j) 375 data.insert(RandomNumber()); 376 } 377} 378BENCHMARK(BM_SetInsert) 379 ->Args({1<<10, 128}) 380 ->Args({2<<10, 128}) 381 ->Args({4<<10, 128}) 382 ->Args({8<<10, 128}) 383 ->Args({1<<10, 512}) 384 ->Args({2<<10, 512}) 385 ->Args({4<<10, 512}) 386 ->Args({8<<10, 512}); 387``` 388 389The preceding code is quite repetitive, and can be replaced with the following 390short-hand. The following macro will pick a few appropriate arguments in the 391product of the two specified ranges and will generate a benchmark for each such 392pair. 393 394<!-- {% raw %} --> 395```c++ 396BENCHMARK(BM_SetInsert)->Ranges({{1<<10, 8<<10}, {128, 512}}); 397``` 398<!-- {% endraw %} --> 399 400Some benchmarks may require specific argument values that cannot be expressed 401with `Ranges`. In this case, `ArgsProduct` offers the ability to generate a 402benchmark input for each combination in the product of the supplied vectors. 403 404<!-- {% raw %} --> 405```c++ 406BENCHMARK(BM_SetInsert) 407 ->ArgsProduct({{1<<10, 3<<10, 8<<10}, {20, 40, 60, 80}}) 408// would generate the same benchmark arguments as 409BENCHMARK(BM_SetInsert) 410 ->Args({1<<10, 20}) 411 ->Args({3<<10, 20}) 412 ->Args({8<<10, 20}) 413 ->Args({3<<10, 40}) 414 ->Args({8<<10, 40}) 415 ->Args({1<<10, 40}) 416 ->Args({1<<10, 60}) 417 ->Args({3<<10, 60}) 418 ->Args({8<<10, 60}) 419 ->Args({1<<10, 80}) 420 ->Args({3<<10, 80}) 421 ->Args({8<<10, 80}); 422``` 423<!-- {% endraw %} --> 424 425For the most common scenarios, helper methods for creating a list of 426integers for a given sparse or dense range are provided. 427 428```c++ 429BENCHMARK(BM_SetInsert) 430 ->ArgsProduct({ 431 benchmark::CreateRange(8, 128, /*multi=*/2), 432 benchmark::CreateDenseRange(1, 4, /*step=*/1) 433 }) 434// would generate the same benchmark arguments as 435BENCHMARK(BM_SetInsert) 436 ->ArgsProduct({ 437 {8, 16, 32, 64, 128}, 438 {1, 2, 3, 4} 439 }); 440``` 441 442For more complex patterns of inputs, passing a custom function to `Apply` allows 443programmatic specification of an arbitrary set of arguments on which to run the 444benchmark. The following example enumerates a dense range on one parameter, 445and a sparse range on the second. 446 447```c++ 448static void CustomArguments(benchmark::internal::Benchmark* b) { 449 for (int i = 0; i <= 10; ++i) 450 for (int j = 32; j <= 1024*1024; j *= 8) 451 b->Args({i, j}); 452} 453BENCHMARK(BM_SetInsert)->Apply(CustomArguments); 454``` 455 456### Passing Arbitrary Arguments to a Benchmark 457 458In C++11 it is possible to define a benchmark that takes an arbitrary number 459of extra arguments. The `BENCHMARK_CAPTURE(func, test_case_name, ...args)` 460macro creates a benchmark that invokes `func` with the `benchmark::State` as 461the first argument followed by the specified `args...`. 462The `test_case_name` is appended to the name of the benchmark and 463should describe the values passed. 464 465```c++ 466template <class ...Args> 467void BM_takes_args(benchmark::State& state, Args&&... args) { 468 auto args_tuple = std::make_tuple(std::move(args)...); 469 for (auto _ : state) { 470 std::cout << std::get<0>(args_tuple) << ": " << std::get<1>(args_tuple) 471 << '\n'; 472 [...] 473 } 474} 475// Registers a benchmark named "BM_takes_args/int_string_test" that passes 476// the specified values to `args`. 477BENCHMARK_CAPTURE(BM_takes_args, int_string_test, 42, std::string("abc")); 478 479// Registers the same benchmark "BM_takes_args/int_test" that passes 480// the specified values to `args`. 481BENCHMARK_CAPTURE(BM_takes_args, int_test, 42, 43); 482``` 483 484Note that elements of `...args` may refer to global variables. Users should 485avoid modifying global state inside of a benchmark. 486 487<a name="asymptotic-complexity" /> 488 489## Calculating Asymptotic Complexity (Big O) 490 491Asymptotic complexity might be calculated for a family of benchmarks. The 492following code will calculate the coefficient for the high-order term in the 493running time and the normalized root-mean square error of string comparison. 494 495```c++ 496static void BM_StringCompare(benchmark::State& state) { 497 std::string s1(state.range(0), '-'); 498 std::string s2(state.range(0), '-'); 499 for (auto _ : state) { 500 auto comparison_result = s1.compare(s2); 501 benchmark::DoNotOptimize(comparison_result); 502 } 503 state.SetComplexityN(state.range(0)); 504} 505BENCHMARK(BM_StringCompare) 506 ->RangeMultiplier(2)->Range(1<<10, 1<<18)->Complexity(benchmark::oN); 507``` 508 509As shown in the following invocation, asymptotic complexity might also be 510calculated automatically. 511 512```c++ 513BENCHMARK(BM_StringCompare) 514 ->RangeMultiplier(2)->Range(1<<10, 1<<18)->Complexity(); 515``` 516 517The following code will specify asymptotic complexity with a lambda function, 518that might be used to customize high-order term calculation. 519 520```c++ 521BENCHMARK(BM_StringCompare)->RangeMultiplier(2) 522 ->Range(1<<10, 1<<18)->Complexity([](benchmark::IterationCount n)->double{return n; }); 523``` 524 525<a name="custom-benchmark-name" /> 526 527## Custom Benchmark Name 528 529You can change the benchmark's name as follows: 530 531```c++ 532BENCHMARK(BM_memcpy)->Name("memcpy")->RangeMultiplier(2)->Range(8, 8<<10); 533``` 534 535The invocation will execute the benchmark as before using `BM_memcpy` but changes 536the prefix in the report to `memcpy`. 537 538<a name="templated-benchmarks" /> 539 540## Templated Benchmarks 541 542This example produces and consumes messages of size `sizeof(v)` `range_x` 543times. It also outputs throughput in the absence of multiprogramming. 544 545```c++ 546template <class Q> void BM_Sequential(benchmark::State& state) { 547 Q q; 548 typename Q::value_type v; 549 for (auto _ : state) { 550 for (int i = state.range(0); i--; ) 551 q.push(v); 552 for (int e = state.range(0); e--; ) 553 q.Wait(&v); 554 } 555 // actually messages, not bytes: 556 state.SetBytesProcessed( 557 static_cast<int64_t>(state.iterations())*state.range(0)); 558} 559// C++03 560BENCHMARK_TEMPLATE(BM_Sequential, WaitQueue<int>)->Range(1<<0, 1<<10); 561 562// C++11 or newer, you can use the BENCHMARK macro with template parameters: 563BENCHMARK(BM_Sequential<WaitQueue<int>>)->Range(1<<0, 1<<10); 564 565``` 566 567Three macros are provided for adding benchmark templates. 568 569```c++ 570#ifdef BENCHMARK_HAS_CXX11 571#define BENCHMARK(func<...>) // Takes any number of parameters. 572#else // C++ < C++11 573#define BENCHMARK_TEMPLATE(func, arg1) 574#endif 575#define BENCHMARK_TEMPLATE1(func, arg1) 576#define BENCHMARK_TEMPLATE2(func, arg1, arg2) 577``` 578 579<a name="templated-benchmarks-with-arguments" /> 580 581## Templated Benchmarks that take arguments 582 583Sometimes there is a need to template benchmarks, and provide arguments to them. 584 585```c++ 586template <class Q> void BM_Sequential_With_Step(benchmark::State& state, int step) { 587 Q q; 588 typename Q::value_type v; 589 for (auto _ : state) { 590 for (int i = state.range(0); i-=step; ) 591 q.push(v); 592 for (int e = state.range(0); e-=step; ) 593 q.Wait(&v); 594 } 595 // actually messages, not bytes: 596 state.SetBytesProcessed( 597 static_cast<int64_t>(state.iterations())*state.range(0)); 598} 599 600BENCHMARK_TEMPLATE1_CAPTURE(BM_Sequential, WaitQueue<int>, Step1, 1)->Range(1<<0, 1<<10); 601``` 602 603<a name="fixtures" /> 604 605## Fixtures 606 607Fixture tests are created by first defining a type that derives from 608`::benchmark::Fixture` and then creating/registering the tests using the 609following macros: 610 611* `BENCHMARK_F(ClassName, Method)` 612* `BENCHMARK_DEFINE_F(ClassName, Method)` 613* `BENCHMARK_REGISTER_F(ClassName, Method)` 614 615For Example: 616 617```c++ 618class MyFixture : public benchmark::Fixture { 619public: 620 void SetUp(::benchmark::State& state) { 621 } 622 623 void TearDown(::benchmark::State& state) { 624 } 625}; 626 627BENCHMARK_F(MyFixture, FooTest)(benchmark::State& st) { 628 for (auto _ : st) { 629 ... 630 } 631} 632 633BENCHMARK_DEFINE_F(MyFixture, BarTest)(benchmark::State& st) { 634 for (auto _ : st) { 635 ... 636 } 637} 638/* BarTest is NOT registered */ 639BENCHMARK_REGISTER_F(MyFixture, BarTest)->Threads(2); 640/* BarTest is now registered */ 641``` 642 643### Templated Fixtures 644 645Also you can create templated fixture by using the following macros: 646 647* `BENCHMARK_TEMPLATE_F(ClassName, Method, ...)` 648* `BENCHMARK_TEMPLATE_DEFINE_F(ClassName, Method, ...)` 649 650For example: 651 652```c++ 653template<typename T> 654class MyFixture : public benchmark::Fixture {}; 655 656BENCHMARK_TEMPLATE_F(MyFixture, IntTest, int)(benchmark::State& st) { 657 for (auto _ : st) { 658 ... 659 } 660} 661 662BENCHMARK_TEMPLATE_DEFINE_F(MyFixture, DoubleTest, double)(benchmark::State& st) { 663 for (auto _ : st) { 664 ... 665 } 666} 667 668BENCHMARK_REGISTER_F(MyFixture, DoubleTest)->Threads(2); 669``` 670 671<a name="custom-counters" /> 672 673## Custom Counters 674 675You can add your own counters with user-defined names. The example below 676will add columns "Foo", "Bar" and "Baz" in its output: 677 678```c++ 679static void UserCountersExample1(benchmark::State& state) { 680 double numFoos = 0, numBars = 0, numBazs = 0; 681 for (auto _ : state) { 682 // ... count Foo,Bar,Baz events 683 } 684 state.counters["Foo"] = numFoos; 685 state.counters["Bar"] = numBars; 686 state.counters["Baz"] = numBazs; 687} 688``` 689 690The `state.counters` object is a `std::map` with `std::string` keys 691and `Counter` values. The latter is a `double`-like class, via an implicit 692conversion to `double&`. Thus you can use all of the standard arithmetic 693assignment operators (`=,+=,-=,*=,/=`) to change the value of each counter. 694 695In multithreaded benchmarks, each counter is set on the calling thread only. 696When the benchmark finishes, the counters from each thread will be summed; 697the resulting sum is the value which will be shown for the benchmark. 698 699The `Counter` constructor accepts three parameters: the value as a `double` 700; a bit flag which allows you to show counters as rates, and/or as per-thread 701iteration, and/or as per-thread averages, and/or iteration invariants, 702and/or finally inverting the result; and a flag specifying the 'unit' - i.e. 703is 1k a 1000 (default, `benchmark::Counter::OneK::kIs1000`), or 1024 704(`benchmark::Counter::OneK::kIs1024`)? 705 706```c++ 707 // sets a simple counter 708 state.counters["Foo"] = numFoos; 709 710 // Set the counter as a rate. It will be presented divided 711 // by the duration of the benchmark. 712 // Meaning: per one second, how many 'foo's are processed? 713 state.counters["FooRate"] = Counter(numFoos, benchmark::Counter::kIsRate); 714 715 // Set the counter as a rate. It will be presented divided 716 // by the duration of the benchmark, and the result inverted. 717 // Meaning: how many seconds it takes to process one 'foo'? 718 state.counters["FooInvRate"] = Counter(numFoos, benchmark::Counter::kIsRate | benchmark::Counter::kInvert); 719 720 // Set the counter as a thread-average quantity. It will 721 // be presented divided by the number of threads. 722 state.counters["FooAvg"] = Counter(numFoos, benchmark::Counter::kAvgThreads); 723 724 // There's also a combined flag: 725 state.counters["FooAvgRate"] = Counter(numFoos,benchmark::Counter::kAvgThreadsRate); 726 727 // This says that we process with the rate of state.range(0) bytes every iteration: 728 state.counters["BytesProcessed"] = Counter(state.range(0), benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); 729``` 730 731When you're compiling in C++11 mode or later you can use `insert()` with 732`std::initializer_list`: 733 734<!-- {% raw %} --> 735```c++ 736 // With C++11, this can be done: 737 state.counters.insert({{"Foo", numFoos}, {"Bar", numBars}, {"Baz", numBazs}}); 738 // ... instead of: 739 state.counters["Foo"] = numFoos; 740 state.counters["Bar"] = numBars; 741 state.counters["Baz"] = numBazs; 742``` 743<!-- {% endraw %} --> 744 745### Counter Reporting 746 747When using the console reporter, by default, user counters are printed at 748the end after the table, the same way as ``bytes_processed`` and 749``items_processed``. This is best for cases in which there are few counters, 750or where there are only a couple of lines per benchmark. Here's an example of 751the default output: 752 753``` 754------------------------------------------------------------------------------ 755Benchmark Time CPU Iterations UserCounters... 756------------------------------------------------------------------------------ 757BM_UserCounter/threads:8 2248 ns 10277 ns 68808 Bar=16 Bat=40 Baz=24 Foo=8 758BM_UserCounter/threads:1 9797 ns 9788 ns 71523 Bar=2 Bat=5 Baz=3 Foo=1024m 759BM_UserCounter/threads:2 4924 ns 9842 ns 71036 Bar=4 Bat=10 Baz=6 Foo=2 760BM_UserCounter/threads:4 2589 ns 10284 ns 68012 Bar=8 Bat=20 Baz=12 Foo=4 761BM_UserCounter/threads:8 2212 ns 10287 ns 68040 Bar=16 Bat=40 Baz=24 Foo=8 762BM_UserCounter/threads:16 1782 ns 10278 ns 68144 Bar=32 Bat=80 Baz=48 Foo=16 763BM_UserCounter/threads:32 1291 ns 10296 ns 68256 Bar=64 Bat=160 Baz=96 Foo=32 764BM_UserCounter/threads:4 2615 ns 10307 ns 68040 Bar=8 Bat=20 Baz=12 Foo=4 765BM_Factorial 26 ns 26 ns 26608979 40320 766BM_Factorial/real_time 26 ns 26 ns 26587936 40320 767BM_CalculatePiRange/1 16 ns 16 ns 45704255 0 768BM_CalculatePiRange/8 73 ns 73 ns 9520927 3.28374 769BM_CalculatePiRange/64 609 ns 609 ns 1140647 3.15746 770BM_CalculatePiRange/512 4900 ns 4901 ns 142696 3.14355 771``` 772 773If this doesn't suit you, you can print each counter as a table column by 774passing the flag `--benchmark_counters_tabular=true` to the benchmark 775application. This is best for cases in which there are a lot of counters, or 776a lot of lines per individual benchmark. Note that this will trigger a 777reprinting of the table header any time the counter set changes between 778individual benchmarks. Here's an example of corresponding output when 779`--benchmark_counters_tabular=true` is passed: 780 781``` 782--------------------------------------------------------------------------------------- 783Benchmark Time CPU Iterations Bar Bat Baz Foo 784--------------------------------------------------------------------------------------- 785BM_UserCounter/threads:8 2198 ns 9953 ns 70688 16 40 24 8 786BM_UserCounter/threads:1 9504 ns 9504 ns 73787 2 5 3 1 787BM_UserCounter/threads:2 4775 ns 9550 ns 72606 4 10 6 2 788BM_UserCounter/threads:4 2508 ns 9951 ns 70332 8 20 12 4 789BM_UserCounter/threads:8 2055 ns 9933 ns 70344 16 40 24 8 790BM_UserCounter/threads:16 1610 ns 9946 ns 70720 32 80 48 16 791BM_UserCounter/threads:32 1192 ns 9948 ns 70496 64 160 96 32 792BM_UserCounter/threads:4 2506 ns 9949 ns 70332 8 20 12 4 793-------------------------------------------------------------- 794Benchmark Time CPU Iterations 795-------------------------------------------------------------- 796BM_Factorial 26 ns 26 ns 26392245 40320 797BM_Factorial/real_time 26 ns 26 ns 26494107 40320 798BM_CalculatePiRange/1 15 ns 15 ns 45571597 0 799BM_CalculatePiRange/8 74 ns 74 ns 9450212 3.28374 800BM_CalculatePiRange/64 595 ns 595 ns 1173901 3.15746 801BM_CalculatePiRange/512 4752 ns 4752 ns 147380 3.14355 802BM_CalculatePiRange/4k 37970 ns 37972 ns 18453 3.14184 803BM_CalculatePiRange/32k 303733 ns 303744 ns 2305 3.14162 804BM_CalculatePiRange/256k 2434095 ns 2434186 ns 288 3.1416 805BM_CalculatePiRange/1024k 9721140 ns 9721413 ns 71 3.14159 806BM_CalculatePi/threads:8 2255 ns 9943 ns 70936 807``` 808 809Note above the additional header printed when the benchmark changes from 810``BM_UserCounter`` to ``BM_Factorial``. This is because ``BM_Factorial`` does 811not have the same counter set as ``BM_UserCounter``. 812 813<a name="multithreaded-benchmarks"/> 814 815## Multithreaded Benchmarks 816 817In a multithreaded test (benchmark invoked by multiple threads simultaneously), 818it is guaranteed that none of the threads will start until all have reached 819the start of the benchmark loop, and all will have finished before any thread 820exits the benchmark loop. (This behavior is also provided by the `KeepRunning()` 821API) As such, any global setup or teardown can be wrapped in a check against the thread 822index: 823 824```c++ 825static void BM_MultiThreaded(benchmark::State& state) { 826 if (state.thread_index() == 0) { 827 // Setup code here. 828 } 829 for (auto _ : state) { 830 // Run the test as normal. 831 } 832 if (state.thread_index() == 0) { 833 // Teardown code here. 834 } 835} 836BENCHMARK(BM_MultiThreaded)->Threads(2); 837``` 838 839To run the benchmark across a range of thread counts, instead of `Threads`, use 840`ThreadRange`. This takes two parameters (`min_threads` and `max_threads`) and 841runs the benchmark once for values in the inclusive range. For example: 842 843```c++ 844BENCHMARK(BM_MultiThreaded)->ThreadRange(1, 8); 845``` 846 847will run `BM_MultiThreaded` with thread counts 1, 2, 4, and 8. 848 849If the benchmarked code itself uses threads and you want to compare it to 850single-threaded code, you may want to use real-time ("wallclock") measurements 851for latency comparisons: 852 853```c++ 854BENCHMARK(BM_test)->Range(8, 8<<10)->UseRealTime(); 855``` 856 857Without `UseRealTime`, CPU time is used by default. 858 859<a name="cpu-timers" /> 860 861## CPU Timers 862 863By default, the CPU timer only measures the time spent by the main thread. 864If the benchmark itself uses threads internally, this measurement may not 865be what you are looking for. Instead, there is a way to measure the total 866CPU usage of the process, by all the threads. 867 868```c++ 869void callee(int i); 870 871static void MyMain(int size) { 872#pragma omp parallel for 873 for(int i = 0; i < size; i++) 874 callee(i); 875} 876 877static void BM_OpenMP(benchmark::State& state) { 878 for (auto _ : state) 879 MyMain(state.range(0)); 880} 881 882// Measure the time spent by the main thread, use it to decide for how long to 883// run the benchmark loop. Depending on the internal implementation detail may 884// measure to anywhere from near-zero (the overhead spent before/after work 885// handoff to worker thread[s]) to the whole single-thread time. 886BENCHMARK(BM_OpenMP)->Range(8, 8<<10); 887 888// Measure the user-visible time, the wall clock (literally, the time that 889// has passed on the clock on the wall), use it to decide for how long to 890// run the benchmark loop. This will always be meaningful, and will match the 891// time spent by the main thread in single-threaded case, in general decreasing 892// with the number of internal threads doing the work. 893BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->UseRealTime(); 894 895// Measure the total CPU consumption, use it to decide for how long to 896// run the benchmark loop. This will always measure to no less than the 897// time spent by the main thread in single-threaded case. 898BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->MeasureProcessCPUTime(); 899 900// A mixture of the last two. Measure the total CPU consumption, but use the 901// wall clock to decide for how long to run the benchmark loop. 902BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->MeasureProcessCPUTime()->UseRealTime(); 903``` 904 905### Controlling Timers 906 907Normally, the entire duration of the work loop (`for (auto _ : state) {}`) 908is measured. But sometimes, it is necessary to do some work inside of 909that loop, every iteration, but without counting that time to the benchmark time. 910That is possible, although it is not recommended, since it has high overhead. 911 912<!-- {% raw %} --> 913```c++ 914static void BM_SetInsert_With_Timer_Control(benchmark::State& state) { 915 std::set<int> data; 916 for (auto _ : state) { 917 state.PauseTiming(); // Stop timers. They will not count until they are resumed. 918 data = ConstructRandomSet(state.range(0)); // Do something that should not be measured 919 state.ResumeTiming(); // And resume timers. They are now counting again. 920 // The rest will be measured. 921 for (int j = 0; j < state.range(1); ++j) 922 data.insert(RandomNumber()); 923 } 924} 925BENCHMARK(BM_SetInsert_With_Timer_Control)->Ranges({{1<<10, 8<<10}, {128, 512}}); 926``` 927<!-- {% endraw %} --> 928 929<a name="manual-timing" /> 930 931## Manual Timing 932 933For benchmarking something for which neither CPU time nor real-time are 934correct or accurate enough, completely manual timing is supported using 935the `UseManualTime` function. 936 937When `UseManualTime` is used, the benchmarked code must call 938`SetIterationTime` once per iteration of the benchmark loop to 939report the manually measured time. 940 941An example use case for this is benchmarking GPU execution (e.g. OpenCL 942or CUDA kernels, OpenGL or Vulkan or Direct3D draw calls), which cannot 943be accurately measured using CPU time or real-time. Instead, they can be 944measured accurately using a dedicated API, and these measurement results 945can be reported back with `SetIterationTime`. 946 947```c++ 948static void BM_ManualTiming(benchmark::State& state) { 949 int microseconds = state.range(0); 950 std::chrono::duration<double, std::micro> sleep_duration { 951 static_cast<double>(microseconds) 952 }; 953 954 for (auto _ : state) { 955 auto start = std::chrono::high_resolution_clock::now(); 956 // Simulate some useful workload with a sleep 957 std::this_thread::sleep_for(sleep_duration); 958 auto end = std::chrono::high_resolution_clock::now(); 959 960 auto elapsed_seconds = 961 std::chrono::duration_cast<std::chrono::duration<double>>( 962 end - start); 963 964 state.SetIterationTime(elapsed_seconds.count()); 965 } 966} 967BENCHMARK(BM_ManualTiming)->Range(1, 1<<17)->UseManualTime(); 968``` 969 970<a name="setting-the-time-unit" /> 971 972## Setting the Time Unit 973 974If a benchmark runs a few milliseconds it may be hard to visually compare the 975measured times, since the output data is given in nanoseconds per default. In 976order to manually set the time unit, you can specify it manually: 977 978```c++ 979BENCHMARK(BM_test)->Unit(benchmark::kMillisecond); 980``` 981 982Additionally the default time unit can be set globally with the 983`--benchmark_time_unit={ns|us|ms|s}` command line argument. The argument only 984affects benchmarks where the time unit is not set explicitly. 985 986<a name="preventing-optimization" /> 987 988## Preventing Optimization 989 990To prevent a value or expression from being optimized away by the compiler 991the `benchmark::DoNotOptimize(...)` and `benchmark::ClobberMemory()` 992functions can be used. 993 994```c++ 995static void BM_test(benchmark::State& state) { 996 for (auto _ : state) { 997 int x = 0; 998 for (int i=0; i < 64; ++i) { 999 benchmark::DoNotOptimize(x += i); 1000 } 1001 } 1002} 1003``` 1004 1005`DoNotOptimize(<expr>)` forces the *result* of `<expr>` to be stored in either 1006memory or a register. For GNU based compilers it acts as read/write barrier 1007for global memory. More specifically it forces the compiler to flush pending 1008writes to memory and reload any other values as necessary. 1009 1010Note that `DoNotOptimize(<expr>)` does not prevent optimizations on `<expr>` 1011in any way. `<expr>` may even be removed entirely when the result is already 1012known. For example: 1013 1014```c++ 1015 /* Example 1: `<expr>` is removed entirely. */ 1016 int foo(int x) { return x + 42; } 1017 while (...) DoNotOptimize(foo(0)); // Optimized to DoNotOptimize(42); 1018 1019 /* Example 2: Result of '<expr>' is only reused */ 1020 int bar(int) __attribute__((const)); 1021 while (...) DoNotOptimize(bar(0)); // Optimized to: 1022 // int __result__ = bar(0); 1023 // while (...) DoNotOptimize(__result__); 1024``` 1025 1026The second tool for preventing optimizations is `ClobberMemory()`. In essence 1027`ClobberMemory()` forces the compiler to perform all pending writes to global 1028memory. Memory managed by block scope objects must be "escaped" using 1029`DoNotOptimize(...)` before it can be clobbered. In the below example 1030`ClobberMemory()` prevents the call to `v.push_back(42)` from being optimized 1031away. 1032 1033```c++ 1034static void BM_vector_push_back(benchmark::State& state) { 1035 for (auto _ : state) { 1036 std::vector<int> v; 1037 v.reserve(1); 1038 auto data = v.data(); // Allow v.data() to be clobbered. Pass as non-const 1039 benchmark::DoNotOptimize(data); // lvalue to avoid undesired compiler optimizations 1040 v.push_back(42); 1041 benchmark::ClobberMemory(); // Force 42 to be written to memory. 1042 } 1043} 1044``` 1045 1046Note that `ClobberMemory()` is only available for GNU or MSVC based compilers. 1047 1048<a name="reporting-statistics" /> 1049 1050## Statistics: Reporting the Mean, Median and Standard Deviation / Coefficient of variation of Repeated Benchmarks 1051 1052By default each benchmark is run once and that single result is reported. 1053However benchmarks are often noisy and a single result may not be representative 1054of the overall behavior. For this reason it's possible to repeatedly rerun the 1055benchmark. 1056 1057The number of runs of each benchmark is specified globally by the 1058`--benchmark_repetitions` flag or on a per benchmark basis by calling 1059`Repetitions` on the registered benchmark object. When a benchmark is run more 1060than once the mean, median, standard deviation and coefficient of variation 1061of the runs will be reported. 1062 1063Additionally the `--benchmark_report_aggregates_only={true|false}`, 1064`--benchmark_display_aggregates_only={true|false}` flags or 1065`ReportAggregatesOnly(bool)`, `DisplayAggregatesOnly(bool)` functions can be 1066used to change how repeated tests are reported. By default the result of each 1067repeated run is reported. When `report aggregates only` option is `true`, 1068only the aggregates (i.e. mean, median, standard deviation and coefficient 1069of variation, maybe complexity measurements if they were requested) of the runs 1070is reported, to both the reporters - standard output (console), and the file. 1071However when only the `display aggregates only` option is `true`, 1072only the aggregates are displayed in the standard output, while the file 1073output still contains everything. 1074Calling `ReportAggregatesOnly(bool)` / `DisplayAggregatesOnly(bool)` on a 1075registered benchmark object overrides the value of the appropriate flag for that 1076benchmark. 1077 1078<a name="custom-statistics" /> 1079 1080## Custom Statistics 1081 1082While having these aggregates is nice, this may not be enough for everyone. 1083For example you may want to know what the largest observation is, e.g. because 1084you have some real-time constraints. This is easy. The following code will 1085specify a custom statistic to be calculated, defined by a lambda function. 1086 1087```c++ 1088void BM_spin_empty(benchmark::State& state) { 1089 for (auto _ : state) { 1090 for (int x = 0; x < state.range(0); ++x) { 1091 benchmark::DoNotOptimize(x); 1092 } 1093 } 1094} 1095 1096BENCHMARK(BM_spin_empty) 1097 ->ComputeStatistics("max", [](const std::vector<double>& v) -> double { 1098 return *(std::max_element(std::begin(v), std::end(v))); 1099 }) 1100 ->Arg(512); 1101``` 1102 1103While usually the statistics produce values in time units, 1104you can also produce percentages: 1105 1106```c++ 1107void BM_spin_empty(benchmark::State& state) { 1108 for (auto _ : state) { 1109 for (int x = 0; x < state.range(0); ++x) { 1110 benchmark::DoNotOptimize(x); 1111 } 1112 } 1113} 1114 1115BENCHMARK(BM_spin_empty) 1116 ->ComputeStatistics("ratio", [](const std::vector<double>& v) -> double { 1117 return std::begin(v) / std::end(v); 1118 }, benchmark::StatisticUnit::kPercentage) 1119 ->Arg(512); 1120``` 1121 1122<a name="memory-usage" /> 1123 1124## Memory Usage 1125 1126It's often useful to also track memory usage for benchmarks, alongside CPU 1127performance. For this reason, benchmark offers the `RegisterMemoryManager` 1128method that allows a custom `MemoryManager` to be injected. 1129 1130If set, the `MemoryManager::Start` and `MemoryManager::Stop` methods will be 1131called at the start and end of benchmark runs to allow user code to fill out 1132a report on the number of allocations, bytes used, etc. 1133 1134This data will then be reported alongside other performance data, currently 1135only when using JSON output. 1136 1137<a name="using-register-benchmark" /> 1138 1139## Using RegisterBenchmark(name, fn, args...) 1140 1141The `RegisterBenchmark(name, func, args...)` function provides an alternative 1142way to create and register benchmarks. 1143`RegisterBenchmark(name, func, args...)` creates, registers, and returns a 1144pointer to a new benchmark with the specified `name` that invokes 1145`func(st, args...)` where `st` is a `benchmark::State` object. 1146 1147Unlike the `BENCHMARK` registration macros, which can only be used at the global 1148scope, the `RegisterBenchmark` can be called anywhere. This allows for 1149benchmark tests to be registered programmatically. 1150 1151Additionally `RegisterBenchmark` allows any callable object to be registered 1152as a benchmark. Including capturing lambdas and function objects. 1153 1154For Example: 1155```c++ 1156auto BM_test = [](benchmark::State& st, auto Inputs) { /* ... */ }; 1157 1158int main(int argc, char** argv) { 1159 for (auto& test_input : { /* ... */ }) 1160 benchmark::RegisterBenchmark(test_input.name(), BM_test, test_input); 1161 benchmark::Initialize(&argc, argv); 1162 benchmark::RunSpecifiedBenchmarks(); 1163 benchmark::Shutdown(); 1164} 1165``` 1166 1167<a name="exiting-with-an-error" /> 1168 1169## Exiting with an Error 1170 1171When errors caused by external influences, such as file I/O and network 1172communication, occur within a benchmark the 1173`State::SkipWithError(const std::string& msg)` function can be used to skip that run 1174of benchmark and report the error. Note that only future iterations of the 1175`KeepRunning()` are skipped. For the ranged-for version of the benchmark loop 1176Users must explicitly exit the loop, otherwise all iterations will be performed. 1177Users may explicitly return to exit the benchmark immediately. 1178 1179The `SkipWithError(...)` function may be used at any point within the benchmark, 1180including before and after the benchmark loop. Moreover, if `SkipWithError(...)` 1181has been used, it is not required to reach the benchmark loop and one may return 1182from the benchmark function early. 1183 1184For example: 1185 1186```c++ 1187static void BM_test(benchmark::State& state) { 1188 auto resource = GetResource(); 1189 if (!resource.good()) { 1190 state.SkipWithError("Resource is not good!"); 1191 // KeepRunning() loop will not be entered. 1192 } 1193 while (state.KeepRunning()) { 1194 auto data = resource.read_data(); 1195 if (!resource.good()) { 1196 state.SkipWithError("Failed to read data!"); 1197 break; // Needed to skip the rest of the iteration. 1198 } 1199 do_stuff(data); 1200 } 1201} 1202 1203static void BM_test_ranged_fo(benchmark::State & state) { 1204 auto resource = GetResource(); 1205 if (!resource.good()) { 1206 state.SkipWithError("Resource is not good!"); 1207 return; // Early return is allowed when SkipWithError() has been used. 1208 } 1209 for (auto _ : state) { 1210 auto data = resource.read_data(); 1211 if (!resource.good()) { 1212 state.SkipWithError("Failed to read data!"); 1213 break; // REQUIRED to prevent all further iterations. 1214 } 1215 do_stuff(data); 1216 } 1217} 1218``` 1219<a name="a-faster-keep-running-loop" /> 1220 1221## A Faster KeepRunning Loop 1222 1223In C++11 mode, a ranged-based for loop should be used in preference to 1224the `KeepRunning` loop for running the benchmarks. For example: 1225 1226```c++ 1227static void BM_Fast(benchmark::State &state) { 1228 for (auto _ : state) { 1229 FastOperation(); 1230 } 1231} 1232BENCHMARK(BM_Fast); 1233``` 1234 1235The reason the ranged-for loop is faster than using `KeepRunning`, is 1236because `KeepRunning` requires a memory load and store of the iteration count 1237ever iteration, whereas the ranged-for variant is able to keep the iteration count 1238in a register. 1239 1240For example, an empty inner loop of using the ranged-based for method looks like: 1241 1242```asm 1243# Loop Init 1244 mov rbx, qword ptr [r14 + 104] 1245 call benchmark::State::StartKeepRunning() 1246 test rbx, rbx 1247 je .LoopEnd 1248.LoopHeader: # =>This Inner Loop Header: Depth=1 1249 add rbx, -1 1250 jne .LoopHeader 1251.LoopEnd: 1252``` 1253 1254Compared to an empty `KeepRunning` loop, which looks like: 1255 1256```asm 1257.LoopHeader: # in Loop: Header=BB0_3 Depth=1 1258 cmp byte ptr [rbx], 1 1259 jne .LoopInit 1260.LoopBody: # =>This Inner Loop Header: Depth=1 1261 mov rax, qword ptr [rbx + 8] 1262 lea rcx, [rax + 1] 1263 mov qword ptr [rbx + 8], rcx 1264 cmp rax, qword ptr [rbx + 104] 1265 jb .LoopHeader 1266 jmp .LoopEnd 1267.LoopInit: 1268 mov rdi, rbx 1269 call benchmark::State::StartKeepRunning() 1270 jmp .LoopBody 1271.LoopEnd: 1272``` 1273 1274Unless C++03 compatibility is required, the ranged-for variant of writing 1275the benchmark loop should be preferred. 1276 1277<a name="disabling-cpu-frequency-scaling" /> 1278 1279## Disabling CPU Frequency Scaling 1280 1281If you see this error: 1282 1283``` 1284***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 1285be noisy and will incur extra overhead. 1286``` 1287 1288you might want to disable the CPU frequency scaling while running the 1289benchmark, as well as consider other ways to stabilize the performance of 1290your system while benchmarking. 1291 1292See [Reducing Variance](reducing_variance.md) for more information. 1293