Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init |
|
#
9423e459 |
| 24-Dec-2023 |
Benjamin Kramer <benny.kra@googlemail.com> |
[ProfileData] Copy CallTargetMaps a bit less. NFCI
|
Revision tags: llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4 |
|
#
ef0e0adc |
| 17-Oct-2023 |
William Junda Huang <williamjhuang@google.com> |
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.o
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
show more ...
|
Revision tags: llvmorg-17.0.3, llvmorg-17.0.2 |
|
#
e7247f10 |
| 30-Sep-2023 |
Tom Stellard <tstellar@redhat.com> |
[profiling] Move option declarations into headers
This will make it possible to add visibility attributes to these variables. This also fixes some type mismatches between the declaration and the de
[profiling] Move option declarations into headers
This will make it possible to add visibility attributes to these variables. This also fixes some type mismatches between the declaration and the definition.
Reviewed By: bogner, huangjd
Differential Revision: https://reviews.llvm.org/D156599
show more ...
|
Revision tags: llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4 |
|
#
111fcb0d |
| 02-Sep-2023 |
Fangrui Song <i@maskray.me> |
[llvm] Fix duplicate word typos. NFC
Those fixes were taken from https://reviews.llvm.org/D137338
|
Revision tags: llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init |
|
#
5cbcaf16 |
| 26-Jun-2023 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO][Preinliner] Bump up the threshold to favor previous compiler inline decision.
The compiler has more insight and knowledge about functions based on their IR and attribures and should make a
[CSSPGO][Preinliner] Bump up the threshold to favor previous compiler inline decision.
The compiler has more insight and knowledge about functions based on their IR and attribures and should make a better inline decision than the offline preinliner does which is purely based on callsites hotness and code size. Therefore I'm making changes to favor previous compiler inline decision by bumping up the callsite allowance.
This should improve the performance by more than 1% according to testing on Meta services.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D153797
show more ...
|
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1 |
|
#
5c2ae37b |
| 27-Mar-2023 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO][Preinliner] Trim cold call edges of the profiled call graph for a more stable profile generation.
I've noticed that for some services CSSPGO profile is less stable than non-CS AutoFDO profi
[CSSPGO][Preinliner] Trim cold call edges of the profiled call graph for a more stable profile generation.
I've noticed that for some services CSSPGO profile is less stable than non-CS AutoFDO profile from profiling to profiling without source changes. This is manifested by comparing profile similarities. For example in my experiments, AutoFDO profiles are always 99+% similar over same binary but different inputs (very close dynamic traffics) while CSSPGO profile similarity is around 90%.
The main source of the profile stability is the top-down order computed on the profiled call graph in the llvm-profgen CS preinliner. The top-down order is used to guide the CS preinliner to pre-compute an inline decision that is later on fulfilled by the compiler. A subtle change in the top-down order from run to run could cause a different inline decision computed. A deeper look in the diversion of the top-down order revealed that: - The topological sorting inside one SCC isn't quite right. This is fixed by {D130717}. - The profiled call graphs of the two sides of the A/B run isn't 100% the same. The call edges in the two runs do not subsume each other, and edges appear in both graphs may not have exactly the same weight. This is due to the nature that the graphs are dynamic. However, I saw that the graphs can be made more close by removing the cold edges from them and this bumped up the CSSPGO profile stableness to the same level of the AutoFDO profile.
Removing cold call edges from the dynamic call graph may have an impact on cold inlining, but so far I haven't seen any performance issues since the CS preinliner mainly targets hot callsites, and cold inlining can always be done by the compiler CGSCC inliner.
Also fixing an issue where the largest weight instead of the accumulated weight for a call edge is used in the profiled call graph.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D147013
show more ...
|
Revision tags: llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3 |
|
#
1e692113 |
| 14-Feb-2023 |
Fangrui Song <i@maskray.me> |
Move global namespace cl::opt inside llvm::
|
#
39eb1c61 |
| 10-Feb-2023 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 50000.
The previous threshold 3000 is too small to enable any inlining for giant functions which come in with bigger size
[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 50000.
The previous threshold 3000 is too small to enable any inlining for giant functions which come in with bigger size than that. In real world, I've seen a big hot function with 34000 dissasembly size. Motivated by that I'm changing the value to 50000.
With the new value the allowance size growth should still be reasonable, as it is also bounded by another threshold, i.e, --sample-profile-inline-growth-limit , which defaults to 12. The new value should mostly only affect giant functions.
I've seen for serveral internal services, the new threshold boosts performance, and it has neutral impact for other services without hot giant functions. So far I haven't seen any performance regression with that.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D143696
show more ...
|
Revision tags: llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init |
|
#
7b81a81d |
| 21-Jul-2022 |
Mircea Trofin <mtrofin@google.com> |
[NFC] FunctionSamples::getEntrySamples -> getHeadSamplesEstimate
The name `getEntrySamples` was misleading for 2 reasons. One, it's close in name to `Function::getEntryCount`, but the equivalent her
[NFC] FunctionSamples::getEntrySamples -> getHeadSamplesEstimate
The name `getEntrySamples` was misleading for 2 reasons. One, it's close in name to `Function::getEntryCount`, but the equivalent here is `getHeadSamples`; second, as opposed to the other get* APIs in `FunctionSamples`, it performs an estimate/heuristic rather than just retrieving raw data (or a non-heuristic derivate off that data, like `getMaxCountInside`)
The new name should more clearly communicate its intent; and, being close (in name) to `getHeadSamples`, it should allow the reader discover the relation between them.
Also updated the doc comments for both `getHeadSamples[Estimate]` so a reader may better understand the relation between them.
Differential Revision: https://reviews.llvm.org/D130281
show more ...
|
#
7e86b13c |
| 28-Jun-2022 |
wlei <wlei@fb.com> |
[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion an
[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.
One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.
Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.
Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D127031
show more ...
|
Revision tags: llvmorg-14.0.6, llvmorg-14.0.5 |
|
#
7d293744 |
| 08-Jun-2022 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 3000
The default value of sample-profile-inline-limit-max is defined as 10000 in sampleprofile.cpp. This is too big for c
[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 3000
The default value of sample-profile-inline-limit-max is defined as 10000 in sampleprofile.cpp. This is too big for cspreinliner which works with assembly size instead of IR size. The value 3000 turns out to be a good tradeoff. Compared to the value 10000, 3000 gives as good performance and code size, but lower build time.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D127330
show more ...
|
Revision tags: llvmorg-14.0.4 |
|
#
a4190037 |
| 09-May-2022 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO][Preinliner] Use linear threshold to drive inline decision.
The per-callsite size threshold used today to drive preinline decision is based on hotness/coldness cutoff. The default setup is f
[CSSPGO][Preinliner] Use linear threshold to drive inline decision.
The per-callsite size threshold used today to drive preinline decision is based on hotness/coldness cutoff. The default setup is for callsites with a sample count above the hotness cutoff (99%), a 1500 size threshold is used. Any callsite below 99.99% coldness cutoff uses a zero threshold. This has a couple issues:
1. While both cutoffs and size thoresholds are configurable, different applications may need different setups, making a universal setup impractical.
2. The callsites between hotness cutoff and coldness cutoff are not considered as inline candidates, which could be a missing opportunity.
3. Hot callsites always use the same threshold. In reality we may want a bigger threshold for hotter callsites.
In this change we are introducing a linear threshold regardless of hot/cold cutoffs. Given a sample space, a threshold is computed for a callsite based on the position of that callsite sample in the whole space. With that we no longer need to define what's hot or cold. Callsites with different hotness will get a different threshold. This should overcome the above three issues.
I have seen good results with a universal default setup for two of our internal services.
For one service, 0.2% to 0.5% perf improvement over a baseline with a previous default setup, on-par code size. For the second service, 0.5% to 0.8% perf improvement over a baseline with a previous default setup, 0.2% code size increase; on-par performance and code size with a baseline that is with a carefully tuned cutoff to cover enough hot functions.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D125023
show more ...
|
Revision tags: llvmorg-14.0.3 |
|
#
e36786d1 |
| 28-Apr-2022 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Rename ProfileIsCSNested and ProfileIsCSFlat
To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `P
[CSSPGO] Rename ProfileIsCSNested and ProfileIsCSFlat
To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `ProfileIsCSNested` is now renamed to `ProfileIsPreInlined` and is extended to be applicable for CS flat profiles too. More specifically, `ProfileIsPreInlined` is for any kind of profiles (flat or nested) that contain 'ShouldBeInlined' contexts. The flag is encoded in the profile summary section for extbinary profiles and is computed on-the-fly for text profiles.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D122602
show more ...
|
Revision tags: llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2 |
|
#
db29f437 |
| 23-Feb-2022 |
serge-sans-paille <sguelton@redhat.com> |
Cleanup include: DebugInfo/Symbolize
Estimation of the impact on preprocessor output after: 1067349756 before:1067487786
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-
Cleanup include: DebugInfo/Symbolize
Estimation of the impact on preprocessor output after: 1067349756 before:1067487786
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D120433
show more ...
|
Revision tags: llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2 |
|
#
f6f0409f |
| 15-Dec-2021 |
Wenlei He <wenlei@fb.com> |
[llvm-profgen] Turn on preinliner by default
preinliner has been tuned on large server workloads and it's not ready to be turned on by default. this change also updates the thresholds based on tunin
[llvm-profgen] Turn on preinliner by default
preinliner has been tuned on large server workloads and it's not ready to be turned on by default. this change also updates the thresholds based on tuning.
Differential Revision: https://reviews.llvm.org/D115770
show more ...
|
#
bf317f66 |
| 29-Nov-2021 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Sorting nodes in a cycle of profiled call graph.
For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes
[CSSPGO] Sorting nodes in a cycle of profiled call graph.
For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes are reached from outside the SCC and inside the SCC, based on the Tarjan algorithm. This does not honor profile edge hotness, thus does not gurantee hot callsites to be inlined prior to cold callsites. To mitigate that, I'm adding an extra sorter on top of scc_iter to sort scc functions in the order of callsite hotness, instead of changing the internal of scc_iter.
Sorting on callsite hotness can be optimally based on detecting cycles on a directed call graph, i.e, to remove the coldest edge until a cycle is broken. However, detecting cycles isn't cheap. I'm using an MST-based approach which is faster and appear to deliver some performance wins.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D114204
show more ...
|
Revision tags: llvmorg-13.0.1-rc1 |
|
#
259e4c56 |
| 27-Oct-2021 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Trim cold base profiles for the CS preinliner.
Adding support to the CS preinliner to trim cold base profiles. This makes trimming consistent with the inline decision made by the preinliner
[CSSPGO] Trim cold base profiles for the CS preinliner.
Adding support to the CS preinliner to trim cold base profiles. This makes trimming consistent with the inline decision made by the preinliner. Also disable the existing profile merger when preinliner is on unless explicitly specified.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D112489
show more ...
|
Revision tags: llvmorg-13.0.0, llvmorg-13.0.0-rc4 |
|
#
446e2162 |
| 16-Sep-2021 |
Wenlei He <aktoon@gmail.com> |
[llvm-profgen] Use context-sensitive byte size cost for preinliner decisions by default
Turn on `use-context-cost-for-preinliner` to use context-sensitive byte size cost for preinliner decisions by
[llvm-profgen] Use context-sensitive byte size cost for preinliner decisions by default
Turn on `use-context-cost-for-preinliner` to use context-sensitive byte size cost for preinliner decisions by default.
This is a more accurate proxy of inline cost than profile size. We tested on our large workload that it delivers measureable CPU improvement.
Differential Revision: https://reviews.llvm.org/D109893
show more ...
|
Revision tags: llvmorg-13.0.0-rc3 |
|
#
f10004e7 |
| 01-Sep-2021 |
Wenlei He <aktoon@gmail.com> |
[CSSPGO] Add stats for pre-inliner
Add some stats to help tuning pre-inliner.
Differential Revision: https://reviews.llvm.org/D109098
|
#
7ca80300 |
| 30-Aug-2021 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Enable loading MD5 CS profile.
Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbin
[CSSPGO] Enable loading MD5 CS profile.
Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster.
There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D108342
show more ...
|
Revision tags: llvmorg-13.0.0-rc2 |
|
#
b9db7036 |
| 25-Aug-2021 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size.
[CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
show more ...
|
#
eca03d27 |
| 17-Aug-2021 |
Wenlei He <aktoon@gmail.com> |
[CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen
This change enables llvm-profgen to use accurate context-sensitive post-optimizat
[CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.
To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.
The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.
Differential Revision: https://reviews.llvm.org/D108180
show more ...
|
Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3 |
|
#
c60f1d5d |
| 16-Jun-2021 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Fix an invalid hash table reference issue in the CS preinliner.
We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, ther
[CSSPGO] Fix an invalid hash table reference issue in the CS preinliner.
We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, therefore updating it in the process of trasvering it may cause issue since the underlying bucket array could change.
I'm also moving the `csspgo-preinliner` switch around so that no context tri will be constructed (by the constructor of `CSPreInliner`) when the switch is off.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104267
show more ...
|
Revision tags: llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1 |
|
#
00ef28ef |
| 08-Apr-2021 |
Wenlei He <aktoon@gmail.com> |
[CSSPGO] Fix dangling context strings and improve profile order consistency and error handling
This patch fixed the following issues along side with some refactoring:
1. Fix bugs where StringRef fo
[CSSPGO] Fix dangling context strings and improve profile order consistency and error handling
This patch fixed the following issues along side with some refactoring:
1. Fix bugs where StringRef for context string out live the underlying std::string. We now keep string table in profile generator to hold std::strings. We also do the same for bracketed context strings in profile writer. 2. Make sure profile output strictly follow (total sample, name) order. Previously, there's inconsistency between ProfileMap's key and FunctionSamples's name, leading to inconsistent ordering. This is now fixed by introducing context profile canonicalization. Assertions are also added to make sure ProfileMap's key and FunctionSamples's name are always consistent. 3. Enhanced error handling for profile writing to make sure we bubble up errors properly for both llvm-profgen and llvm-profdata when string table is not populated correctly for extended binary profile. 4. Keep all internal context representation bracket free. This avoids creating new strings for context trimming, merging and preinline. getNameWithContext API is now simplied accordingly. 5. Factor out the code for context trimming and merging into SampleContextTrimmer in SampleProf.cpp. This enables llvm-profdata to use the trimmer when merging profiles. Changes in llvm-profgen will be in separate patch.
Differential Revision: https://reviews.llvm.org/D100090
show more ...
|
Revision tags: llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4 |
|
#
3e3fc431 |
| 29-Mar-2021 |
Hongtao Yu <hoy@fb.com> |
[CSSPGO] Top-down processing order based on full profile.
Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn
[CSSPGO] Top-down processing order based on full profile.
Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:
1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.
2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.
3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.
4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.
Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.
I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.
The change is an enhancement to https://reviews.llvm.org/D95988.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D99351
show more ...
|