15dda2efdSMircea Trofin# Benchmark Tools 25dda2efdSMircea Trofin 35dda2efdSMircea Trofin## compare.py 45dda2efdSMircea Trofin 55dda2efdSMircea TrofinThe `compare.py` can be used to compare the result of benchmarks. 65dda2efdSMircea Trofin 75dda2efdSMircea Trofin### Dependencies 85dda2efdSMircea TrofinThe utility relies on the [scipy](https://www.scipy.org) package which can be installed using pip: 95dda2efdSMircea Trofin```bash 105dda2efdSMircea Trofinpip3 install -r requirements.txt 115dda2efdSMircea Trofin``` 125dda2efdSMircea Trofin 135dda2efdSMircea Trofin### Displaying aggregates only 145dda2efdSMircea Trofin 155dda2efdSMircea TrofinThe switch `-a` / `--display_aggregates_only` can be used to control the 165dda2efdSMircea Trofindisplayment of the normal iterations vs the aggregates. When passed, it will 175dda2efdSMircea Trofinbe passthrough to the benchmark binaries to be run, and will be accounted for 185dda2efdSMircea Trofinin the tool itself; only the aggregates will be displayed, but not normal runs. 195dda2efdSMircea TrofinIt only affects the display, the separate runs will still be used to calculate 205dda2efdSMircea Trofinthe U test. 215dda2efdSMircea Trofin 225dda2efdSMircea Trofin### Modes of operation 235dda2efdSMircea Trofin 245dda2efdSMircea TrofinThere are three modes of operation: 255dda2efdSMircea Trofin 265dda2efdSMircea Trofin1. Just compare two benchmarks 275dda2efdSMircea TrofinThe program is invoked like: 285dda2efdSMircea Trofin 295dda2efdSMircea Trofin``` bash 305dda2efdSMircea Trofin$ compare.py benchmarks <benchmark_baseline> <benchmark_contender> [benchmark options]... 315dda2efdSMircea Trofin``` 325dda2efdSMircea TrofinWhere `<benchmark_baseline>` and `<benchmark_contender>` either specify a benchmark executable file, or a JSON output file. The type of the input file is automatically detected. If a benchmark executable is specified then the benchmark is run to obtain the results. Otherwise the results are simply loaded from the output file. 335dda2efdSMircea Trofin 345dda2efdSMircea Trofin`[benchmark options]` will be passed to the benchmarks invocations. They can be anything that binary accepts, be it either normal `--benchmark_*` parameters, or some custom parameters your binary takes. 355dda2efdSMircea Trofin 365dda2efdSMircea TrofinExample output: 375dda2efdSMircea Trofin``` 385dda2efdSMircea Trofin$ ./compare.py benchmarks ./a.out ./a.out 395dda2efdSMircea TrofinRUNNING: ./a.out --benchmark_out=/tmp/tmprBT5nW 405dda2efdSMircea TrofinRun on (8 X 4000 MHz CPU s) 415dda2efdSMircea Trofin2017-11-07 21:16:44 425dda2efdSMircea Trofin------------------------------------------------------ 435dda2efdSMircea TrofinBenchmark Time CPU Iterations 445dda2efdSMircea Trofin------------------------------------------------------ 455dda2efdSMircea TrofinBM_memcpy/8 36 ns 36 ns 19101577 211.669MB/s 465dda2efdSMircea TrofinBM_memcpy/64 76 ns 76 ns 9412571 800.199MB/s 475dda2efdSMircea TrofinBM_memcpy/512 84 ns 84 ns 8249070 5.64771GB/s 485dda2efdSMircea TrofinBM_memcpy/1024 116 ns 116 ns 6181763 8.19505GB/s 495dda2efdSMircea TrofinBM_memcpy/8192 643 ns 643 ns 1062855 11.8636GB/s 505dda2efdSMircea TrofinBM_copy/8 222 ns 222 ns 3137987 34.3772MB/s 515dda2efdSMircea TrofinBM_copy/64 1608 ns 1608 ns 432758 37.9501MB/s 525dda2efdSMircea TrofinBM_copy/512 12589 ns 12589 ns 54806 38.7867MB/s 535dda2efdSMircea TrofinBM_copy/1024 25169 ns 25169 ns 27713 38.8003MB/s 545dda2efdSMircea TrofinBM_copy/8192 201165 ns 201112 ns 3486 38.8466MB/s 555dda2efdSMircea TrofinRUNNING: ./a.out --benchmark_out=/tmp/tmpt1wwG_ 565dda2efdSMircea TrofinRun on (8 X 4000 MHz CPU s) 575dda2efdSMircea Trofin2017-11-07 21:16:53 585dda2efdSMircea Trofin------------------------------------------------------ 595dda2efdSMircea TrofinBenchmark Time CPU Iterations 605dda2efdSMircea Trofin------------------------------------------------------ 615dda2efdSMircea TrofinBM_memcpy/8 36 ns 36 ns 19397903 211.255MB/s 625dda2efdSMircea TrofinBM_memcpy/64 73 ns 73 ns 9691174 839.635MB/s 635dda2efdSMircea TrofinBM_memcpy/512 85 ns 85 ns 8312329 5.60101GB/s 645dda2efdSMircea TrofinBM_memcpy/1024 118 ns 118 ns 6438774 8.11608GB/s 655dda2efdSMircea TrofinBM_memcpy/8192 656 ns 656 ns 1068644 11.6277GB/s 665dda2efdSMircea TrofinBM_copy/8 223 ns 223 ns 3146977 34.2338MB/s 675dda2efdSMircea TrofinBM_copy/64 1611 ns 1611 ns 435340 37.8751MB/s 685dda2efdSMircea TrofinBM_copy/512 12622 ns 12622 ns 54818 38.6844MB/s 695dda2efdSMircea TrofinBM_copy/1024 25257 ns 25239 ns 27779 38.6927MB/s 705dda2efdSMircea TrofinBM_copy/8192 205013 ns 205010 ns 3479 38.108MB/s 715dda2efdSMircea TrofinComparing ./a.out to ./a.out 725dda2efdSMircea TrofinBenchmark Time CPU Time Old Time New CPU Old CPU New 735dda2efdSMircea Trofin------------------------------------------------------------------------------------------------------ 745dda2efdSMircea TrofinBM_memcpy/8 +0.0020 +0.0020 36 36 36 36 755dda2efdSMircea TrofinBM_memcpy/64 -0.0468 -0.0470 76 73 76 73 765dda2efdSMircea TrofinBM_memcpy/512 +0.0081 +0.0083 84 85 84 85 775dda2efdSMircea TrofinBM_memcpy/1024 +0.0098 +0.0097 116 118 116 118 785dda2efdSMircea TrofinBM_memcpy/8192 +0.0200 +0.0203 643 656 643 656 795dda2efdSMircea TrofinBM_copy/8 +0.0046 +0.0042 222 223 222 223 805dda2efdSMircea TrofinBM_copy/64 +0.0020 +0.0020 1608 1611 1608 1611 815dda2efdSMircea TrofinBM_copy/512 +0.0027 +0.0026 12589 12622 12589 12622 825dda2efdSMircea TrofinBM_copy/1024 +0.0035 +0.0028 25169 25257 25169 25239 835dda2efdSMircea TrofinBM_copy/8192 +0.0191 +0.0194 201165 205013 201112 205010 845dda2efdSMircea Trofin``` 855dda2efdSMircea Trofin 865dda2efdSMircea TrofinWhat it does is for the every benchmark from the first run it looks for the benchmark with exactly the same name in the second run, and then compares the results. If the names differ, the benchmark is omitted from the diff. 875dda2efdSMircea TrofinAs you can note, the values in `Time` and `CPU` columns are calculated as `(new - old) / |old|`. 885dda2efdSMircea Trofin 895dda2efdSMircea Trofin2. Compare two different filters of one benchmark 905dda2efdSMircea TrofinThe program is invoked like: 915dda2efdSMircea Trofin 925dda2efdSMircea Trofin``` bash 935dda2efdSMircea Trofin$ compare.py filters <benchmark> <filter_baseline> <filter_contender> [benchmark options]... 945dda2efdSMircea Trofin``` 955dda2efdSMircea TrofinWhere `<benchmark>` either specify a benchmark executable file, or a JSON output file. The type of the input file is automatically detected. If a benchmark executable is specified then the benchmark is run to obtain the results. Otherwise the results are simply loaded from the output file. 965dda2efdSMircea Trofin 975dda2efdSMircea TrofinWhere `<filter_baseline>` and `<filter_contender>` are the same regex filters that you would pass to the `[--benchmark_filter=<regex>]` parameter of the benchmark binary. 985dda2efdSMircea Trofin 995dda2efdSMircea Trofin`[benchmark options]` will be passed to the benchmarks invocations. They can be anything that binary accepts, be it either normal `--benchmark_*` parameters, or some custom parameters your binary takes. 1005dda2efdSMircea Trofin 1015dda2efdSMircea TrofinExample output: 1025dda2efdSMircea Trofin``` 1035dda2efdSMircea Trofin$ ./compare.py filters ./a.out BM_memcpy BM_copy 1045dda2efdSMircea TrofinRUNNING: ./a.out --benchmark_filter=BM_memcpy --benchmark_out=/tmp/tmpBWKk0k 1055dda2efdSMircea TrofinRun on (8 X 4000 MHz CPU s) 1065dda2efdSMircea Trofin2017-11-07 21:37:28 1075dda2efdSMircea Trofin------------------------------------------------------ 1085dda2efdSMircea TrofinBenchmark Time CPU Iterations 1095dda2efdSMircea Trofin------------------------------------------------------ 1105dda2efdSMircea TrofinBM_memcpy/8 36 ns 36 ns 17891491 211.215MB/s 1115dda2efdSMircea TrofinBM_memcpy/64 74 ns 74 ns 9400999 825.646MB/s 1125dda2efdSMircea TrofinBM_memcpy/512 87 ns 87 ns 8027453 5.46126GB/s 1135dda2efdSMircea TrofinBM_memcpy/1024 111 ns 111 ns 6116853 8.5648GB/s 1145dda2efdSMircea TrofinBM_memcpy/8192 657 ns 656 ns 1064679 11.6247GB/s 1155dda2efdSMircea TrofinRUNNING: ./a.out --benchmark_filter=BM_copy --benchmark_out=/tmp/tmpAvWcOM 1165dda2efdSMircea TrofinRun on (8 X 4000 MHz CPU s) 1175dda2efdSMircea Trofin2017-11-07 21:37:33 1185dda2efdSMircea Trofin---------------------------------------------------- 1195dda2efdSMircea TrofinBenchmark Time CPU Iterations 1205dda2efdSMircea Trofin---------------------------------------------------- 1215dda2efdSMircea TrofinBM_copy/8 227 ns 227 ns 3038700 33.6264MB/s 1225dda2efdSMircea TrofinBM_copy/64 1640 ns 1640 ns 426893 37.2154MB/s 1235dda2efdSMircea TrofinBM_copy/512 12804 ns 12801 ns 55417 38.1444MB/s 1245dda2efdSMircea TrofinBM_copy/1024 25409 ns 25407 ns 27516 38.4365MB/s 1255dda2efdSMircea TrofinBM_copy/8192 202986 ns 202990 ns 3454 38.4871MB/s 1265dda2efdSMircea TrofinComparing BM_memcpy to BM_copy (from ./a.out) 1275dda2efdSMircea TrofinBenchmark Time CPU Time Old Time New CPU Old CPU New 1285dda2efdSMircea Trofin-------------------------------------------------------------------------------------------------------------------- 1295dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/8 +5.2829 +5.2812 36 227 36 227 1305dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/64 +21.1719 +21.1856 74 1640 74 1640 1315dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/512 +145.6487 +145.6097 87 12804 87 12801 1325dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/1024 +227.1860 +227.1776 111 25409 111 25407 1335dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/8192 +308.1664 +308.2898 657 202986 656 202990 1345dda2efdSMircea Trofin``` 1355dda2efdSMircea Trofin 1365dda2efdSMircea TrofinAs you can see, it applies filter to the benchmarks, both when running the benchmark, and before doing the diff. And to make the diff work, the matches are replaced with some common string. Thus, you can compare two different benchmark families within one benchmark binary. 1375dda2efdSMircea TrofinAs you can note, the values in `Time` and `CPU` columns are calculated as `(new - old) / |old|`. 1385dda2efdSMircea Trofin 1395dda2efdSMircea Trofin3. Compare filter one from benchmark one to filter two from benchmark two: 1405dda2efdSMircea TrofinThe program is invoked like: 1415dda2efdSMircea Trofin 1425dda2efdSMircea Trofin``` bash 1435dda2efdSMircea Trofin$ compare.py filters <benchmark_baseline> <filter_baseline> <benchmark_contender> <filter_contender> [benchmark options]... 1445dda2efdSMircea Trofin``` 1455dda2efdSMircea Trofin 1465dda2efdSMircea TrofinWhere `<benchmark_baseline>` and `<benchmark_contender>` either specify a benchmark executable file, or a JSON output file. The type of the input file is automatically detected. If a benchmark executable is specified then the benchmark is run to obtain the results. Otherwise the results are simply loaded from the output file. 1475dda2efdSMircea Trofin 1485dda2efdSMircea TrofinWhere `<filter_baseline>` and `<filter_contender>` are the same regex filters that you would pass to the `[--benchmark_filter=<regex>]` parameter of the benchmark binary. 1495dda2efdSMircea Trofin 1505dda2efdSMircea Trofin`[benchmark options]` will be passed to the benchmarks invocations. They can be anything that binary accepts, be it either normal `--benchmark_*` parameters, or some custom parameters your binary takes. 1515dda2efdSMircea Trofin 1525dda2efdSMircea TrofinExample output: 1535dda2efdSMircea Trofin``` 1545dda2efdSMircea Trofin$ ./compare.py benchmarksfiltered ./a.out BM_memcpy ./a.out BM_copy 1555dda2efdSMircea TrofinRUNNING: ./a.out --benchmark_filter=BM_memcpy --benchmark_out=/tmp/tmp_FvbYg 1565dda2efdSMircea TrofinRun on (8 X 4000 MHz CPU s) 1575dda2efdSMircea Trofin2017-11-07 21:38:27 1585dda2efdSMircea Trofin------------------------------------------------------ 1595dda2efdSMircea TrofinBenchmark Time CPU Iterations 1605dda2efdSMircea Trofin------------------------------------------------------ 1615dda2efdSMircea TrofinBM_memcpy/8 37 ns 37 ns 18953482 204.118MB/s 1625dda2efdSMircea TrofinBM_memcpy/64 74 ns 74 ns 9206578 828.245MB/s 1635dda2efdSMircea TrofinBM_memcpy/512 91 ns 91 ns 8086195 5.25476GB/s 1645dda2efdSMircea TrofinBM_memcpy/1024 120 ns 120 ns 5804513 7.95662GB/s 1655dda2efdSMircea TrofinBM_memcpy/8192 664 ns 664 ns 1028363 11.4948GB/s 1665dda2efdSMircea TrofinRUNNING: ./a.out --benchmark_filter=BM_copy --benchmark_out=/tmp/tmpDfL5iE 1675dda2efdSMircea TrofinRun on (8 X 4000 MHz CPU s) 1685dda2efdSMircea Trofin2017-11-07 21:38:32 1695dda2efdSMircea Trofin---------------------------------------------------- 1705dda2efdSMircea TrofinBenchmark Time CPU Iterations 1715dda2efdSMircea Trofin---------------------------------------------------- 1725dda2efdSMircea TrofinBM_copy/8 230 ns 230 ns 2985909 33.1161MB/s 1735dda2efdSMircea TrofinBM_copy/64 1654 ns 1653 ns 419408 36.9137MB/s 1745dda2efdSMircea TrofinBM_copy/512 13122 ns 13120 ns 53403 37.2156MB/s 1755dda2efdSMircea TrofinBM_copy/1024 26679 ns 26666 ns 26575 36.6218MB/s 1765dda2efdSMircea TrofinBM_copy/8192 215068 ns 215053 ns 3221 36.3283MB/s 1775dda2efdSMircea TrofinComparing BM_memcpy (from ./a.out) to BM_copy (from ./a.out) 1785dda2efdSMircea TrofinBenchmark Time CPU Time Old Time New CPU Old CPU New 1795dda2efdSMircea Trofin-------------------------------------------------------------------------------------------------------------------- 1805dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/8 +5.1649 +5.1637 37 230 37 230 1815dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/64 +21.4352 +21.4374 74 1654 74 1653 1825dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/512 +143.6022 +143.5865 91 13122 91 13120 1835dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/1024 +221.5903 +221.4790 120 26679 120 26666 1845dda2efdSMircea Trofin[BM_memcpy vs. BM_copy]/8192 +322.9059 +323.0096 664 215068 664 215053 1855dda2efdSMircea Trofin``` 1865dda2efdSMircea TrofinThis is a mix of the previous two modes, two (potentially different) benchmark binaries are run, and a different filter is applied to each one. 1875dda2efdSMircea TrofinAs you can note, the values in `Time` and `CPU` columns are calculated as `(new - old) / |old|`. 1885dda2efdSMircea Trofin 189*a5b79717SMircea Trofin### Note: Interpreting the output 190*a5b79717SMircea Trofin 191*a5b79717SMircea TrofinPerformance measurements are an art, and performance comparisons are doubly so. 192*a5b79717SMircea TrofinResults are often noisy and don't necessarily have large absolute differences to 193*a5b79717SMircea Trofinthem, so just by visual inspection, it is not at all apparent if two 194*a5b79717SMircea Trofinmeasurements are actually showing a performance change or not. It is even more 195*a5b79717SMircea Trofinconfusing with multiple benchmark repetitions. 196*a5b79717SMircea Trofin 197*a5b79717SMircea TrofinThankfully, what we can do, is use statistical tests on the results to determine 198*a5b79717SMircea Trofinwhether the performance has statistically-significantly changed. `compare.py` 199*a5b79717SMircea Trofinuses [Mann–Whitney U 200*a5b79717SMircea Trofintest](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test), with a null 201*a5b79717SMircea Trofinhypothesis being that there's no difference in performance. 202*a5b79717SMircea Trofin 203*a5b79717SMircea Trofin**The below output is a summary of a benchmark comparison with statistics 204*a5b79717SMircea Trofinprovided for a multi-threaded process.** 205*a5b79717SMircea Trofin``` 206*a5b79717SMircea TrofinBenchmark Time CPU Time Old Time New CPU Old CPU New 207*a5b79717SMircea Trofin----------------------------------------------------------------------------------------------------------------------------- 208*a5b79717SMircea Trofinbenchmark/threads:1/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 27 vs 27 209*a5b79717SMircea Trofinbenchmark/threads:1/process_time/real_time_mean -0.1442 -0.1442 90 77 90 77 210*a5b79717SMircea Trofinbenchmark/threads:1/process_time/real_time_median -0.1444 -0.1444 90 77 90 77 211*a5b79717SMircea Trofinbenchmark/threads:1/process_time/real_time_stddev +0.3974 +0.3933 0 0 0 0 212*a5b79717SMircea Trofinbenchmark/threads:1/process_time/real_time_cv +0.6329 +0.6280 0 0 0 0 213*a5b79717SMircea TrofinOVERALL_GEOMEAN -0.1442 -0.1442 0 0 0 0 214*a5b79717SMircea Trofin``` 215*a5b79717SMircea Trofin-------------------------------------------- 216*a5b79717SMircea TrofinHere's a breakdown of each row: 217*a5b79717SMircea Trofin 218*a5b79717SMircea Trofin**benchmark/threads:1/process_time/real_time_pvalue**: This shows the _p-value_ for 219*a5b79717SMircea Trofinthe statistical test comparing the performance of the process running with one 220*a5b79717SMircea Trofinthread. A value of 0.0000 suggests a statistically significant difference in 221*a5b79717SMircea Trofinperformance. The comparison was conducted using the U Test (Mann-Whitney 222*a5b79717SMircea TrofinU Test) with 27 repetitions for each case. 223*a5b79717SMircea Trofin 224*a5b79717SMircea Trofin**benchmark/threads:1/process_time/real_time_mean**: This shows the relative 225*a5b79717SMircea Trofindifference in mean execution time between two different cases. The negative 226*a5b79717SMircea Trofinvalue (-0.1442) implies that the new process is faster by about 14.42%. The old 227*a5b79717SMircea Trofintime was 90 units, while the new time is 77 units. 228*a5b79717SMircea Trofin 229*a5b79717SMircea Trofin**benchmark/threads:1/process_time/real_time_median**: Similarly, this shows the 230*a5b79717SMircea Trofinrelative difference in the median execution time. Again, the new process is 231*a5b79717SMircea Trofinfaster by 14.44%. 232*a5b79717SMircea Trofin 233*a5b79717SMircea Trofin**benchmark/threads:1/process_time/real_time_stddev**: This is the relative 234*a5b79717SMircea Trofindifference in the standard deviation of the execution time, which is a measure 235*a5b79717SMircea Trofinof how much variation or dispersion there is from the mean. A positive value 236*a5b79717SMircea Trofin(+0.3974) implies there is more variance in the execution time in the new 237*a5b79717SMircea Trofinprocess. 238*a5b79717SMircea Trofin 239*a5b79717SMircea Trofin**benchmark/threads:1/process_time/real_time_cv**: CV stands for Coefficient of 240*a5b79717SMircea TrofinVariation. It is the ratio of the standard deviation to the mean. It provides a 241*a5b79717SMircea Trofinstandardized measure of dispersion. An increase (+0.6329) indicates more 242*a5b79717SMircea Trofinrelative variability in the new process. 243*a5b79717SMircea Trofin 244*a5b79717SMircea Trofin**OVERALL_GEOMEAN**: Geomean stands for geometric mean, a type of average that is 245*a5b79717SMircea Trofinless influenced by outliers. The negative value indicates a general improvement 246*a5b79717SMircea Trofinin the new process. However, given the values are all zero for the old and new 247*a5b79717SMircea Trofintimes, this seems to be a mistake or placeholder in the output. 248*a5b79717SMircea Trofin 249*a5b79717SMircea Trofin----------------------------------------- 250*a5b79717SMircea Trofin 251*a5b79717SMircea Trofin 252*a5b79717SMircea Trofin 253*a5b79717SMircea TrofinLet's first try to see what the different columns represent in the above 254*a5b79717SMircea Trofin`compare.py` benchmarking output: 255*a5b79717SMircea Trofin 256*a5b79717SMircea Trofin 1. **Benchmark:** The name of the function being benchmarked, along with the 257*a5b79717SMircea Trofin size of the input (after the slash). 258*a5b79717SMircea Trofin 259*a5b79717SMircea Trofin 2. **Time:** The average time per operation, across all iterations. 260*a5b79717SMircea Trofin 261*a5b79717SMircea Trofin 3. **CPU:** The average CPU time per operation, across all iterations. 262*a5b79717SMircea Trofin 263*a5b79717SMircea Trofin 4. **Iterations:** The number of iterations the benchmark was run to get a 264*a5b79717SMircea Trofin stable estimate. 265*a5b79717SMircea Trofin 266*a5b79717SMircea Trofin 5. **Time Old and Time New:** These represent the average time it takes for a 267*a5b79717SMircea Trofin function to run in two different scenarios or versions. For example, you 268*a5b79717SMircea Trofin might be comparing how fast a function runs before and after you make some 269*a5b79717SMircea Trofin changes to it. 270*a5b79717SMircea Trofin 271*a5b79717SMircea Trofin 6. **CPU Old and CPU New:** These show the average amount of CPU time that the 272*a5b79717SMircea Trofin function uses in two different scenarios or versions. This is similar to 273*a5b79717SMircea Trofin Time Old and Time New, but focuses on CPU usage instead of overall time. 274*a5b79717SMircea Trofin 275*a5b79717SMircea TrofinIn the comparison section, the relative differences in both time and CPU time 276*a5b79717SMircea Trofinare displayed for each input size. 277*a5b79717SMircea Trofin 278*a5b79717SMircea Trofin 279*a5b79717SMircea TrofinA statistically-significant difference is determined by a **p-value**, which is 280*a5b79717SMircea Trofina measure of the probability that the observed difference could have occurred 281*a5b79717SMircea Trofinjust by random chance. A smaller p-value indicates stronger evidence against the 282*a5b79717SMircea Trofinnull hypothesis. 283*a5b79717SMircea Trofin 284*a5b79717SMircea Trofin**Therefore:** 285*a5b79717SMircea Trofin 1. If the p-value is less than the chosen significance level (alpha), we 286*a5b79717SMircea Trofin reject the null hypothesis and conclude the benchmarks are significantly 287*a5b79717SMircea Trofin different. 288*a5b79717SMircea Trofin 2. If the p-value is greater than or equal to alpha, we fail to reject the 289*a5b79717SMircea Trofin null hypothesis and treat the two benchmarks as similar. 290*a5b79717SMircea Trofin 291*a5b79717SMircea Trofin 292*a5b79717SMircea Trofin 293*a5b79717SMircea TrofinThe result of said the statistical test is additionally communicated through color coding: 294*a5b79717SMircea Trofin```diff 295*a5b79717SMircea Trofin+ Green: 296*a5b79717SMircea Trofin``` 297*a5b79717SMircea Trofin The benchmarks are _**statistically different**_. This could mean the 298*a5b79717SMircea Trofin performance has either **significantly improved** or **significantly 299*a5b79717SMircea Trofin deteriorated**. You should look at the actual performance numbers to see which 300*a5b79717SMircea Trofin is the case. 301*a5b79717SMircea Trofin```diff 302*a5b79717SMircea Trofin- Red: 303*a5b79717SMircea Trofin``` 304*a5b79717SMircea Trofin The benchmarks are _**statistically similar**_. This means the performance 305*a5b79717SMircea Trofin **hasn't significantly changed**. 306*a5b79717SMircea Trofin 307*a5b79717SMircea TrofinIn statistical terms, **'green'** means we reject the null hypothesis that 308*a5b79717SMircea Trofinthere's no difference in performance, and **'red'** means we fail to reject the 309*a5b79717SMircea Trofinnull hypothesis. This might seem counter-intuitive if you're expecting 'green' 310*a5b79717SMircea Trofinto mean 'improved performance' and 'red' to mean 'worsened performance'. 311*a5b79717SMircea Trofin```bash 312*a5b79717SMircea Trofin But remember, in this context: 313*a5b79717SMircea Trofin 314*a5b79717SMircea Trofin 'Success' means 'successfully finding a difference'. 315*a5b79717SMircea Trofin 'Failure' means 'failing to find a difference'. 316*a5b79717SMircea Trofin``` 317*a5b79717SMircea Trofin 318*a5b79717SMircea Trofin 319*a5b79717SMircea TrofinAlso, please note that **even if** we determine that there **is** a 320*a5b79717SMircea Trofinstatistically-significant difference between the two measurements, it does not 321*a5b79717SMircea Trofin_necessarily_ mean that the actual benchmarks that were measured **are** 322*a5b79717SMircea Trofindifferent, or vice versa, even if we determine that there is **no** 323*a5b79717SMircea Trofinstatistically-significant difference between the two measurements, it does not 324*a5b79717SMircea Trofinnecessarily mean that the actual benchmarks that were measured **are not** 325*a5b79717SMircea Trofindifferent. 326*a5b79717SMircea Trofin 327*a5b79717SMircea Trofin 328*a5b79717SMircea Trofin 3295dda2efdSMircea Trofin### U test 3305dda2efdSMircea Trofin 3315dda2efdSMircea TrofinIf there is a sufficient repetition count of the benchmarks, the tool can do 3325dda2efdSMircea Trofina [U Test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test), of the 3335dda2efdSMircea Trofinnull hypothesis that it is equally likely that a randomly selected value from 3345dda2efdSMircea Trofinone sample will be less than or greater than a randomly selected value from a 3355dda2efdSMircea Trofinsecond sample. 3365dda2efdSMircea Trofin 3375dda2efdSMircea TrofinIf the calculated p-value is below this value is lower than the significance 3385dda2efdSMircea Trofinlevel alpha, then the result is said to be statistically significant and the 3395dda2efdSMircea Trofinnull hypothesis is rejected. Which in other words means that the two benchmarks 3405dda2efdSMircea Trofinaren't identical. 3415dda2efdSMircea Trofin 3425dda2efdSMircea Trofin**WARNING**: requires **LARGE** (no less than 9) number of repetitions to be 3435dda2efdSMircea Trofinmeaningful! 344