Revision tags: llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4 |
|
#
ee09f7d1 |
| 26-Aug-2024 |
Amir Ayupov <aaupov@fb.com> |
[MC][NFC] Reduce Address2ProbesMap size
Replace the map from addresses to list of probes with a flat vector containing probe references sorted by their addresses.
Reduces pseudo probe parsing time
[MC][NFC] Reduce Address2ProbesMap size
Replace the map from addresses to list of probes with a flat vector containing probe references sorted by their addresses.
Reduces pseudo probe parsing time from 9.56s to 8.59s and peak RSS from 9.66 GiB to 9.08 GiB as part of perf2bolt processing a large binary.
Test Plan: ``` bin/llvm-lit -sv test/tools/llvm-profgen ```
Reviewers: maksfb, rafaelauler, dcci, ayermolo, wlei-llvm
Reviewed By: wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/102904
show more ...
|
#
04ebd190 |
| 26-Aug-2024 |
Amir Ayupov <aaupov@fb.com> |
[MC][NFC] Statically allocate storage for decoded pseudo probes and function records
Use #102774 to allocate storage for decoded probes (`PseudoProbeVec`) and function records (`InlineTreeVec`).
Le
[MC][NFC] Statically allocate storage for decoded pseudo probes and function records
Use #102774 to allocate storage for decoded probes (`PseudoProbeVec`) and function records (`InlineTreeVec`).
Leverage that to also shrink sizes of `MCDecodedPseudoProbe`: - Drop Guid since it's accessible via `InlineTree`.
`MCDecodedPseudoProbeInlineTree`: - Keep track of probes and inlinees using `ArrayRef`s now that probes and function records belonging to the same function are allocated contiguously.
This reduces peak RSS from 13.7 GiB to 9.7 GiB and pseudo probe parsing time (as part of perf2bolt) from 15.3s to 9.6s for a large binary with 400MiB .pseudo_probe section containing 43M probes and 25M function records.
Depends on: #102774 #102787 #102788
Reviewers: maksfb, rafaelauler, dcci, ayermolo, wlei-llvm
Reviewed By: wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/102789
show more ...
|
Revision tags: llvmorg-19.1.0-rc3 |
|
#
cd15d12f |
| 11-Aug-2024 |
Amir Ayupov <aaupov@fb.com> |
[MC][profgen][NFC] Expand auto for MCDecodedPseudoProbe
Expand autos in select places in preparation to #102789.
Reviewers: dcci, maksfb, WenleiHe, rafaelauler, ayermolo, wlei-llvm
Reviewed By: We
[MC][profgen][NFC] Expand auto for MCDecodedPseudoProbe
Expand autos in select places in preparation to #102789.
Reviewers: dcci, maksfb, WenleiHe, rafaelauler, ayermolo, wlei-llvm
Reviewed By: WenleiHe, wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/102788
show more ...
|
Revision tags: llvmorg-19.1.0-rc2 |
|
#
23609a38 |
| 02-Aug-2024 |
Tim Creech <timothy.m.creech@intel.com> |
[llvm-profgen] Revert #99826 and #99026 (#100147)
Revert #99826 and #99026 to allow for additional input.
|
Revision tags: llvmorg-19.1.0-rc1, llvmorg-20-init |
|
#
0caf0c93 |
| 21-Jul-2024 |
Tim Creech <timothy.m.creech@intel.com> |
[llvm-profgen] Support creating profiles of arbitrary events (#99026)
This change introduces two options which may be used to create profiles
of arbitrary PMU events.
1. `--leading-ip-only` prov
[llvm-profgen] Support creating profiles of arbitrary events (#99026)
This change introduces two options which may be used to create profiles
of arbitrary PMU events.
1. `--leading-ip-only` provides a simple sample-IP-based profile mode.
This is not useful for building a profile of execution frequency, but it
is useful for building new types of profiles.
For example, to build a profile of unpredictable branches:
perf record -b -e branch-misses:upp -o perf.data ... llvm-profgen
--perfdata perf.data --leading-ip-only ...
2. `--perf-event=event` enables the creation of a profile concerned with
a specific event or set of events. The names given should match the
"event" field as emitted by perf-script(1).
This option has two spellings: `--perf-event` and `--perf-events`. The
plural spelling accepts a comma-separated list. The singular spelling
appends a single event name to the set of events which will be used.
This is meant to accommodate event names containing commas.
Combined, these options allow generating multiple kinds of profiles from
a single `perf record` collection. For example, to generate both
execution frequency and branch mispredict profiles:
perf record -c 1000003 -b -e
br_inst_retired.near_taken:upp,br_misp_retired.all_branches:upp ...
llvm-profgen --output execution.prof
--perf-event=br_inst_retired.near_taken:upp ...
llvm-profgen --leading-ip-only --output unpredictable.prof
--perf-event=br_misp_retired.all_branches:upp ...
These additions are in support of more general HWPGO[^1], allowing
feedback from a wider range of hardware events.
[^1]:
https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf
---------
Co-authored-by: Tim Creech <tcreech@tcreech.com>
show more ...
|
#
ce03155a |
| 09-Jul-2024 |
Mircea Trofin <mtrofin@google.com> |
[NFC] Coding style fixes: SampleProf (#98208)
Also some control flow simplifications.
Notably, this doesn't address `sampleprof_error`. I *think* the style
there tries to match `std::error_categ
[NFC] Coding style fixes: SampleProf (#98208)
Also some control flow simplifications.
Notably, this doesn't address `sampleprof_error`. I *think* the style
there tries to match `std::error_category`.
Also left `hash_value` as-is, because it matches what we do in Hashing.h
show more ...
|
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7 |
|
#
b9d40a7a |
| 24-May-2024 |
Lei Wang <wlei@fb.com> |
[llvm-profgen] Improve sample profile density (#92144)
The profile density feature(the amount of samples in the profile
relative to the program size) is used to identify insufficient sample
issue
[llvm-profgen] Improve sample profile density (#92144)
The profile density feature(the amount of samples in the profile
relative to the program size) is used to identify insufficient sample
issue and provide hints for user to increase sample count. A low-density
profile can be inaccurate due to statistical noise, which can hurt FDO
performance.
This change introduces two improvements to the current density work.
1. The density calculation/definition is changed. Previously, the
density of a profile was calculated as the minimum density for all warm
functions (a function was considered warm if its total samples were
within the top N percent of the profile). However, there is a problem
that a high total sample profile can have a very low density, which
makes the density value unstable.
- Instead, we want to find a density number such that if a function's
density is below this value, it is considered low-density function. We
consider the whole profile is bad if a group of low-density functions
have the sum of samples that exceeds N percent cut-off of the total
samples.
- In implementation, we sort the function profiles by density, iterate
them in descending order and keep accumulating the body samples until
the sum exceeds the (100% - N) percentage of the total_samples, the
profile-density is the last(minimum) function-density of processed
functions. We introduce the a flag(`--profile-density-threshold`) for
this percentage threshold.
2. The density is now calculated based on final(compiler used) profiles
instead of merged context-less profiles.
show more ...
|
Revision tags: llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3 |
|
#
24f02517 |
| 16-Feb-2024 |
Lei Wang <wlei@fb.com> |
[llvm-profgen] Filter out ambiguous cold profiles during profile generation (#81803)
For the built-in local initialization function(`__cxx_global_var_init`,
`__tls_init` prefix), there could be mul
[llvm-profgen] Filter out ambiguous cold profiles during profile generation (#81803)
For the built-in local initialization function(`__cxx_global_var_init`,
`__tls_init` prefix), there could be multiple versions of the functions
in the final binary, e.g. `__cxx_global_var_init`, which is a wrapper of
global variable ctors, the compiler could assign suffixes like
`__cxx_global_var_init.N` for different ctors.
However, in the profile generation, we call `getCanonicalFnName` to
canonicalize the names which strip the suffixes. Therefore, samples from
different functions queries the same profile(only
`__cxx_global_var_init`) and the counts are merged. As the functions are
essentially different, entries of the merged profile are ambiguous. In
sample loading, for each version of this function, the IR from one
version would be attributed towards a merged entries, which is
inaccurate, especially for fuzzy profile matching, it gets multiple
callsites(from different function) but using to match one callsite,
which mislead the matching and report a lot of false positives.
Hence, we want to filter them out from the profile map during the
profile generation time. The profiles are all cold functions, it won't
have perf impact.
show more ...
|
Revision tags: llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4 |
|
#
ef0e0adc |
| 17-Oct-2023 |
William Junda Huang <williamjhuang@google.com> |
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.o
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
show more ...
|
Revision tags: llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2 |
|
#
7624de5b |
| 01-Aug-2023 |
William Huang <williamjhuang@google.com> |
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader.
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact: When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
show more ...
|
Revision tags: llvmorg-17.0.0-rc1 |
|
#
1a53b5c3 |
| 28-Jul-2023 |
Aaron Ballman <aaron@aaronballman.com> |
Revert "[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map"
This reverts commit 66ba71d913df7f7cd75e92c0c4265932b7c93292.
Addressin
Revert "[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map"
This reverts commit 66ba71d913df7f7cd75e92c0c4265932b7c93292.
Addressing issues found by: https://lab.llvm.org/buildbot/#/builders/245/builds/11732 https://lab.llvm.org/buildbot/#/builders/187/builds/12251 https://lab.llvm.org/buildbot/#/builders/186/builds/11099 https://lab.llvm.org/buildbot/#/builders/182/builds/6976
show more ...
|
Revision tags: llvmorg-18-init |
|
#
66ba71d9 |
| 12-Jul-2023 |
William Huang <williamjhuang@google.com> |
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader.
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact: When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
show more ...
|
#
58056ae2 |
| 27-Jun-2023 |
Haojian Wu <hokein.wu@gmail.com> |
Revert "[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map"
This reverts commit 12e9c7aaa66b7624b5d7666ce2794d912bf9e4b7.
The commi
Revert "[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map"
This reverts commit 12e9c7aaa66b7624b5d7666ce2794d912bf9e4b7.
The commit has broken the buildbot, see comment https://reviews.llvm.org/D147740#4451540
show more ...
|
#
12e9c7aa |
| 24-Jun-2023 |
William Huang <williamjhuang@google.com> |
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader.
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact: When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
show more ...
|
#
1a0d23ef |
| 25-Jun-2023 |
Wenlei He <aktoon@gmail.com> |
[NFC] Generalize llvm-profgen message to cover both AutoFDO and CSSPGO
Update llvm-profgen profile density message to cover both AutoFDO and CSSPGO.
Differential Revision: https://reviews.llvm.org/
[NFC] Generalize llvm-profgen message to cover both AutoFDO and CSSPGO
Update llvm-profgen profile density message to cover both AutoFDO and CSSPGO.
Differential Revision: https://reviews.llvm.org/D153730
show more ...
|
#
c9a8a0e8 |
| 24-Jun-2023 |
Douglas Yung <douglas.yung@sony.com> |
Revert "[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map"
This reverts commit 31af18bccea95fe1ae8aa2c51cf7c8e92a1c208e.
This chan
Revert "[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map"
This reverts commit 31af18bccea95fe1ae8aa2c51cf7c8e92a1c208e.
This change is causing build failures on many Windows build bots:
https://lab.llvm.org/buildbot/#/builders/216/builds/22833 https://lab.llvm.org/buildbot/#/builders/123/builds/19602 https://lab.llvm.org/buildbot/#/builders/172/builds/28315 https://lab.llvm.org/buildbot/#/builders/119/builds/13870 https://lab.llvm.org/buildbot/#/builders/233/builds/794 https://lab.llvm.org/buildbot/#/builders/235/builds/387 https://lab.llvm.org/buildbot/#/builders/13/builds/36921 https://lab.llvm.org/buildbot/#/builders/127/builds/50510
show more ...
|
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5 |
|
#
31af18bc |
| 25-May-2023 |
William Huang <williamjhuang@google.com> |
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader.
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact: When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
show more ...
|
Revision tags: llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2 |
|
#
345fd0c1 |
| 10-Apr-2023 |
Hongtao Yu <hoy@fb.com> |
[FS-AFDO] Generate pseudo-probe-based profiles with FS-discriminators.
This change enables generating pseudo-probe-based FS-AFDO profiles. The change is straightforward based-on previous change {D14
[FS-AFDO] Generate pseudo-probe-based profiles with FS-discriminators.
This change enables generating pseudo-probe-based FS-AFDO profiles. The change is straightforward based-on previous change {D147651} by just injecting FS-discriminators into various profile generation spot.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D147957
show more ...
|
#
d38d6ca1 |
| 29-Apr-2023 |
William Huang <williamjhuang@google.com> |
[llvm-profdata] Deprecate Compact Binary Sample Profile Format
Remove support for compact binary sample profile format
Reviewed By: davidxl, wenlei
Differential Revision: https://reviews.llvm.org/
[llvm-profdata] Deprecate Compact Binary Sample Profile Format
Remove support for compact binary sample profile format
Reviewed By: davidxl, wenlei
Differential Revision: https://reviews.llvm.org/D149400
show more ...
|
Revision tags: llvmorg-16.0.1 |
|
#
339b8a00 |
| 20-Mar-2023 |
wlei <wlei@fb.com> |
[AutoFDO] Use flattened profiles for profile staleness metrics
For profile staleness report, before it only counts for the top-level function samples in the nested profile, the samples in the inline
[AutoFDO] Use flattened profiles for profile staleness metrics
For profile staleness report, before it only counts for the top-level function samples in the nested profile, the samples in the inlinees are ignored. This could affect the quality of the metrics when there are heavily inlined functions. This change adds a feature to flatten the nested profile and we're changing to use flatten profile as the input for stale profile detection and matching. Example for profile flattening:
``` Original profile: _Z3bazi:20301:1000 1: 1000 3: 2000 5: inline1:1600 1: 600 3: inline2:500 1: 500
Flattened profile: _Z3bazi:18701:1000 1: 1000 3: 2000 5: 600 inline1:600 inline1:1100:600 1: 600 3: 500 inline2: 500 inline2:500:500 1: 500 ``` This feature could be useful for offline analysis, like understanding the hotness of each individual function. So I'm adding the support to `llvm-profdata merge` under `--gen-flattened-profile`.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D146452
show more ...
|
Revision tags: llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7 |
|
#
5d7950a4 |
| 16-Dec-2022 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO][llvm-profgen] Missing frame inference.
This change introduces a missing frame inferrer aiming at fixing missing frames. It current only handles missing frames due to the compiler tail call
[CSSPGO][llvm-profgen] Missing frame inference.
This change introduces a missing frame inferrer aiming at fixing missing frames. It current only handles missing frames due to the compiler tail call elimination (TCE) but could also be extended to supporting other scenarios like frame pointer omission. When a tail called function is sampled, the caller frame will be missing from the call chain because the caller frame is reused for the callee frame. While TCE is beneficial to both perf and reducing stack overflow, a workaround being made in this change aims to find back the missing frames as much as possible.
The idea behind this work is to build a dynamic call graph that consists of only tail call edges constructed from LBR samples and DFS-search for a unique path for a given source frame and target frame on the graph. The unique path will be used to fill in the missing frames between the source and target. Note that only a unique path counts. Multiple paths are treated unreachable since we don't want to overcount for any particular possible path.
A switch --infer-missing-frame is introduced and defaults to be on.
Some testing results: - 0.4% perf win according to three internal benchmarks. - About 2/3 of the missing tail call frames can be recovered, according to an internal benchmark. - 10% more profile generation time.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D139367
show more ...
|
Revision tags: llvmorg-15.0.6 |
|
#
c2250d8b |
| 24-Nov-2022 |
Fangrui Song <i@maskray.me> |
[CSSPGO] Move cl::opt inside llvm:: after D100528 and D108342
|
Revision tags: llvmorg-15.0.5, llvmorg-15.0.4 |
|
#
c5667778 |
| 01-Nov-2022 |
Hongtao Yu <hoy@fb.com> |
[llvm-profgen] Fix a typo in collectProfiledFunctions
As titled. The change should have minimal impact since the targets of branch samples are mostly covered by range samples.
Reviewed By: wenlei,
[llvm-profgen] Fix a typo in collectProfiledFunctions
As titled. The change should have minimal impact since the targets of branch samples are mostly covered by range samples.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D137203
show more ...
|
Revision tags: llvmorg-15.0.3 |
|
#
d5a963ab |
| 17-Oct-2022 |
Hongtao Yu <hoy@fb.com> |
[PseudoProbe] Replace relocation with offset for entry probe.
Currently pseudo probe encoding for a function is like: - For the first probe, a relocation from it to its physical position in the cod
[PseudoProbe] Replace relocation with offset for entry probe.
Currently pseudo probe encoding for a function is like: - For the first probe, a relocation from it to its physical position in the code body - For subsequent probes, an incremental offset from the current probe to the previous probe
The relocation could potentially cause relocation overflow during link time. I'm now replacing it with an offset from the first probe to the function start address.
A source function could be lowered into multiple binary functions due to outlining (e.g, coro-split). Since those binary function have independent link-time layout, to really avoid relocations from .pseudo_probe sections to .text sections, the offset to replace with should really be the offset from the probe's enclosing binary function, rather than from the entry of the source function. This requires some changes to previous section-based emission scheme which now switches to be function-based. The assembly form of pseudo probe directive is also changed correspondingly, i.e, reflecting the binary function name.
Most of the source functions end up with only one binary function. For those don't, a sentinel probe is emitted for each of the binary functions with a different name from the source. The sentinel probe indicates the binary function name to differentiate subsequent probes from the ones from a different binary function. For examples, given source function
``` Foo() { … Probe 1 … Probe 2 } ```
If it is transformed into two binary functions:
``` Foo: …
Foo.outlined: … ```
The encoding for the two binary functions will be separate:
```
GUID of Foo Probe 1
GUID of Foo Sentinel probe of Foo.outlined Probe 2 ```
Then probe1 will be decoded against binary `Foo`'s address, and Probe 2 will be decoded against `Foo.outlined`. The sentinel probe of `Foo.outlined` makes sure there's not accidental relocation from `Foo.outlined`'s probes to `Foo`'s entry address.
On the BOLT side, to be minimal intrusive, the pseudo probe re-encoding sticks with the old encoding format. This is fine since unlike linker, Bolt processes the pseudo probe section as a whole and it is free from relocation overflow issues.
The change is downwards compatible as long as there's no mixed use of the old encoding and the new encoding.
Reviewed By: wenlei, maksfb
Differential Revision: https://reviews.llvm.org/D135912 Differential Revision: https://reviews.llvm.org/D135914 Differential Revision: https://reviews.llvm.org/D136394
show more ...
|
#
91cc53d5 |
| 26-Oct-2022 |
wlei <wlei@fb.com> |
[llvm-profgen] Do not cache the frame location stack during computing inlined context size
In `computeInlinedContextSizeForRange`, the offset of range is only used one time, there is no need to cach
[llvm-profgen] Do not cache the frame location stack during computing inlined context size
In `computeInlinedContextSizeForRange`, the offset of range is only used one time, there is no need to cache the frame location stack. Measured on one internal service binary, this can save 2GB memory usage and reduce a small run time (avoid one hash search).
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D128859
show more ...
|