History log of /llvm-project/llvm/tools/llvm-profgen/CSPreInliner.cpp (Results 1 – 25 of 26)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init
# 9423e459 24-Dec-2023 Benjamin Kramer <benny.kra@googlemail.com>

[ProfileData] Copy CallTargetMaps a bit less. NFCI


Revision tags: llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4
# ef0e0adc 17-Oct-2023 William Junda Huang <williamjhuang@google.com>

[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)

This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.o

[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)

This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740

In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.

ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.

When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s

show more ...


Revision tags: llvmorg-17.0.3, llvmorg-17.0.2
# e7247f10 30-Sep-2023 Tom Stellard <tstellar@redhat.com>

[profiling] Move option declarations into headers

This will make it possible to add visibility attributes to these
variables. This also fixes some type mismatches between the
declaration and the de

[profiling] Move option declarations into headers

This will make it possible to add visibility attributes to these
variables. This also fixes some type mismatches between the
declaration and the definition.

Reviewed By: bogner, huangjd

Differential Revision: https://reviews.llvm.org/D156599

show more ...


Revision tags: llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4
# 111fcb0d 02-Sep-2023 Fangrui Song <i@maskray.me>

[llvm] Fix duplicate word typos. NFC

Those fixes were taken from https://reviews.llvm.org/D137338


Revision tags: llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init
# 5cbcaf16 26-Jun-2023 Hongtao Yu <hoy@fb.com>

[CSSPGO][Preinliner] Bump up the threshold to favor previous compiler inline decision.

The compiler has more insight and knowledge about functions based on their IR and attribures and should make a

[CSSPGO][Preinliner] Bump up the threshold to favor previous compiler inline decision.

The compiler has more insight and knowledge about functions based on their IR and attribures and should make a better inline decision than the offline preinliner does which is purely based on callsites hotness and code size. Therefore I'm making changes to favor previous compiler inline decision by bumping up the callsite allowance.

This should improve the performance by more than 1% according to testing on Meta services.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D153797

show more ...


Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1
# 5c2ae37b 27-Mar-2023 Hongtao Yu <hoy@fb.com>

[CSSPGO][Preinliner] Trim cold call edges of the profiled call graph for a more stable profile generation.

I've noticed that for some services CSSPGO profile is less stable than non-CS AutoFDO profi

[CSSPGO][Preinliner] Trim cold call edges of the profiled call graph for a more stable profile generation.

I've noticed that for some services CSSPGO profile is less stable than non-CS AutoFDO profile from profiling to profiling without source changes. This is manifested by comparing profile similarities. For example in my experiments, AutoFDO profiles are always 99+% similar over same binary but different inputs (very close dynamic traffics) while CSSPGO profile similarity is around 90%.

The main source of the profile stability is the top-down order computed on the profiled call graph in the llvm-profgen CS preinliner. The top-down order is used to guide the CS preinliner to pre-compute an inline decision that is later on fulfilled by the compiler. A subtle change in the top-down order from run to run could cause a different inline decision computed. A deeper look in the diversion of the top-down order revealed that:
- The topological sorting inside one SCC isn't quite right. This is fixed by {D130717}.
- The profiled call graphs of the two sides of the A/B run isn't 100% the same. The call edges in the two runs do not subsume each other, and edges appear in both graphs may not have exactly the same weight. This is due to the nature that the graphs are dynamic. However, I saw that the graphs can be made more close by removing the cold edges from them and this bumped up the CSSPGO profile stableness to the same level of the AutoFDO profile.

Removing cold call edges from the dynamic call graph may have an impact on cold inlining, but so far I haven't seen any performance issues since the CS preinliner mainly targets hot callsites, and cold inlining can always be done by the compiler CGSCC inliner.

Also fixing an issue where the largest weight instead of the accumulated weight for a call edge is used in the profiled call graph.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D147013

show more ...


Revision tags: llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3
# 1e692113 14-Feb-2023 Fangrui Song <i@maskray.me>

Move global namespace cl::opt inside llvm::


# 39eb1c61 10-Feb-2023 Hongtao Yu <hoy@fb.com>

[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 50000.

The previous threshold 3000 is too small to enable any inlining for giant functions which come in with bigger size

[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 50000.

The previous threshold 3000 is too small to enable any inlining for giant functions which come in with bigger size than that. In real world, I've seen a big hot function with 34000 dissasembly size. Motivated by that I'm changing the value to 50000.

With the new value the allowance size growth should still be reasonable, as it is also bounded by another threshold, i.e, --sample-profile-inline-growth-limit , which defaults to 12. The new value should mostly only affect giant functions.

I've seen for serveral internal services, the new threshold boosts performance, and it has neutral impact for other services without hot giant functions. So far I haven't seen any performance regression with that.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D143696

show more ...


Revision tags: llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init
# 7b81a81d 21-Jul-2022 Mircea Trofin <mtrofin@google.com>

[NFC] FunctionSamples::getEntrySamples -> getHeadSamplesEstimate

The name `getEntrySamples` was misleading for 2 reasons. One, it's
close in name to `Function::getEntryCount`, but the equivalent her

[NFC] FunctionSamples::getEntrySamples -> getHeadSamplesEstimate

The name `getEntrySamples` was misleading for 2 reasons. One, it's
close in name to `Function::getEntryCount`, but the equivalent here is
`getHeadSamples`; second, as opposed to the other get* APIs in
`FunctionSamples`, it performs an estimate/heuristic rather than just
retrieving raw data (or a non-heuristic derivate off that data, like
`getMaxCountInside`)

The new name should more clearly communicate its intent; and, being
close (in name) to `getHeadSamples`, it should allow the reader discover
the relation between them.

Also updated the doc comments for both `getHeadSamples[Estimate]` so a
reader may better understand the relation between them.

Differential Revision: https://reviews.llvm.org/D130281

show more ...


# 7e86b13c 28-Jun-2022 wlei <wlei@fb.com>

[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie

This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion an

[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie

This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.

One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.

Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.

Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D127031

show more ...


Revision tags: llvmorg-14.0.6, llvmorg-14.0.5
# 7d293744 08-Jun-2022 Hongtao Yu <hoy@fb.com>

[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 3000

The default value of sample-profile-inline-limit-max is defined as 10000 in sampleprofile.cpp. This is too big for c

[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 3000

The default value of sample-profile-inline-limit-max is defined as 10000 in sampleprofile.cpp. This is too big for cspreinliner which works with assembly size instead of IR size. The value 3000 turns out to be a good tradeoff. Compared to the value 10000, 3000 gives as good performance and code size, but lower build time.

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D127330

show more ...


Revision tags: llvmorg-14.0.4
# a4190037 09-May-2022 Hongtao Yu <hoy@fb.com>

[CSSPGO][Preinliner] Use linear threshold to drive inline decision.

The per-callsite size threshold used today to drive preinline decision is based on hotness/coldness cutoff. The default setup is f

[CSSPGO][Preinliner] Use linear threshold to drive inline decision.

The per-callsite size threshold used today to drive preinline decision is based on hotness/coldness cutoff. The default setup is for callsites with a sample count above the hotness cutoff (99%), a 1500 size threshold is used. Any callsite below 99.99% coldness cutoff uses a zero threshold. This has a couple issues:

1. While both cutoffs and size thoresholds are configurable, different applications may need different setups, making a universal setup impractical.

2. The callsites between hotness cutoff and coldness cutoff are not considered as inline candidates, which could be a missing opportunity.

3. Hot callsites always use the same threshold. In reality we may want a bigger threshold for hotter callsites.

In this change we are introducing a linear threshold regardless of hot/cold cutoffs. Given a sample space, a threshold is computed for a callsite based on the position of that callsite sample in the whole space. With that we no longer need to define what's hot or cold. Callsites with different hotness will get a different threshold. This should overcome the above three issues.

I have seen good results with a universal default setup for two of our internal services.

For one service, 0.2% to 0.5% perf improvement over a baseline with a previous default setup, on-par code size.
For the second service, 0.5% to 0.8% perf improvement over a baseline with a previous default setup, 0.2% code size increase; on-par performance and code size with a baseline that is with a carefully tuned cutoff to cover enough hot functions.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D125023

show more ...


Revision tags: llvmorg-14.0.3
# e36786d1 28-Apr-2022 Hongtao Yu <hoy@fb.com>

[CSSPGO] Rename ProfileIsCSNested and ProfileIsCSFlat

To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `P

[CSSPGO] Rename ProfileIsCSNested and ProfileIsCSFlat

To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `ProfileIsCSNested` is now renamed to `ProfileIsPreInlined` and is extended to be applicable for CS flat profiles too. More specifically, `ProfileIsPreInlined` is for any kind of profiles (flat or nested) that contain 'ShouldBeInlined' contexts. The flag is encoded in the profile summary section for extbinary profiles and is computed on-the-fly for text profiles.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D122602

show more ...


Revision tags: llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2
# db29f437 23-Feb-2022 serge-sans-paille <sguelton@redhat.com>

Cleanup include: DebugInfo/Symbolize

Estimation of the impact on preprocessor output
after: 1067349756
before:1067487786

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-

Cleanup include: DebugInfo/Symbolize

Estimation of the impact on preprocessor output
after: 1067349756
before:1067487786

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120433

show more ...


Revision tags: llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2
# f6f0409f 15-Dec-2021 Wenlei He <wenlei@fb.com>

[llvm-profgen] Turn on preinliner by default

preinliner has been tuned on large server workloads and it's not ready to be turned on by default. this change also updates the thresholds based on tunin

[llvm-profgen] Turn on preinliner by default

preinliner has been tuned on large server workloads and it's not ready to be turned on by default. this change also updates the thresholds based on tuning.

Differential Revision: https://reviews.llvm.org/D115770

show more ...


# bf317f66 29-Nov-2021 Hongtao Yu <hoy@fb.com>

[CSSPGO] Sorting nodes in a cycle of profiled call graph.

For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes

[CSSPGO] Sorting nodes in a cycle of profiled call graph.

For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes are reached from outside the SCC and inside the SCC, based on the Tarjan algorithm. This does not honor profile edge hotness, thus does not gurantee hot callsites to be inlined prior to cold callsites. To mitigate that, I'm adding an extra sorter on top of scc_iter to sort scc functions in the order of callsite hotness, instead of changing the internal of scc_iter.

Sorting on callsite hotness can be optimally based on detecting cycles on a directed call graph, i.e, to remove the coldest edge until a cycle is broken. However, detecting cycles isn't cheap. I'm using an MST-based approach which is faster and appear to deliver some performance wins.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D114204

show more ...


Revision tags: llvmorg-13.0.1-rc1
# 259e4c56 27-Oct-2021 Hongtao Yu <hoy@fb.com>

[CSSPGO] Trim cold base profiles for the CS preinliner.

Adding support to the CS preinliner to trim cold base profiles. This makes trimming consistent with the inline decision made by the preinliner

[CSSPGO] Trim cold base profiles for the CS preinliner.

Adding support to the CS preinliner to trim cold base profiles. This makes trimming consistent with the inline decision made by the preinliner. Also disable the existing profile merger when preinliner is on unless explicitly specified.

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D112489

show more ...


Revision tags: llvmorg-13.0.0, llvmorg-13.0.0-rc4
# 446e2162 16-Sep-2021 Wenlei He <aktoon@gmail.com>

[llvm-profgen] Use context-sensitive byte size cost for preinliner decisions by default

Turn on `use-context-cost-for-preinliner` to use context-sensitive byte size cost for preinliner decisions by

[llvm-profgen] Use context-sensitive byte size cost for preinliner decisions by default

Turn on `use-context-cost-for-preinliner` to use context-sensitive byte size cost for preinliner decisions by default.

This is a more accurate proxy of inline cost than profile size. We tested on our large workload that it delivers measureable CPU improvement.

Differential Revision: https://reviews.llvm.org/D109893

show more ...


Revision tags: llvmorg-13.0.0-rc3
# f10004e7 01-Sep-2021 Wenlei He <aktoon@gmail.com>

[CSSPGO] Add stats for pre-inliner

Add some stats to help tuning pre-inliner.

Differential Revision: https://reviews.llvm.org/D109098


# 7ca80300 30-Aug-2021 Hongtao Yu <hoy@fb.com>

[CSSPGO] Enable loading MD5 CS profile.

Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbin

[CSSPGO] Enable loading MD5 CS profile.

Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster.

There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work.

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D108342

show more ...


Revision tags: llvmorg-13.0.0-rc2
# b9db7036 25-Aug-2021 Hongtao Yu <hoy@fb.com>

[CSSPGO] Split context string to deduplicate function name used in the context.

Currently context strings contain a lot of duplicated function names and that significantly increase the profile size.

[CSSPGO] Split context string to deduplicate function name used in the context.

Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.

A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.

When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.

Testing
This is no-diff change in terms of code quality and profile content (for text profile).

For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.

The compile time of ads is dropped by 25%.

Differential Revision: https://reviews.llvm.org/D107299

show more ...


# eca03d27 17-Aug-2021 Wenlei He <aktoon@gmail.com>

[CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen

This change enables llvm-profgen to use accurate context-sensitive post-optimizat

[CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen

This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.

To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.

The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.

Differential Revision: https://reviews.llvm.org/D108180

show more ...


Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3
# c60f1d5d 16-Jun-2021 Hongtao Yu <hoy@fb.com>

[CSSPGO] Fix an invalid hash table reference issue in the CS preinliner.

We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, ther

[CSSPGO] Fix an invalid hash table reference issue in the CS preinliner.

We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, therefore updating it in the process of trasvering it may cause issue since the underlying bucket array could change.

I'm also moving the `csspgo-preinliner` switch around so that no context tri will be constructed (by the constructor of `CSPreInliner`) when the switch is off.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D104267

show more ...


Revision tags: llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1
# 00ef28ef 08-Apr-2021 Wenlei He <aktoon@gmail.com>

[CSSPGO] Fix dangling context strings and improve profile order consistency and error handling

This patch fixed the following issues along side with some refactoring:

1. Fix bugs where StringRef fo

[CSSPGO] Fix dangling context strings and improve profile order consistency and error handling

This patch fixed the following issues along side with some refactoring:

1. Fix bugs where StringRef for context string out live the underlying std::string. We now keep string table in profile generator to hold std::strings. We also do the same for bracketed context strings in profile writer.
2. Make sure profile output strictly follow (total sample, name) order. Previously, there's inconsistency between ProfileMap's key and FunctionSamples's name, leading to inconsistent ordering. This is now fixed by introducing context profile canonicalization. Assertions are also added to make sure ProfileMap's key and FunctionSamples's name are always consistent.
3. Enhanced error handling for profile writing to make sure we bubble up errors properly for both llvm-profgen and llvm-profdata when string table is not populated correctly for extended binary profile.
4. Keep all internal context representation bracket free. This avoids creating new strings for context trimming, merging and preinline. getNameWithContext API is now simplied accordingly.
5. Factor out the code for context trimming and merging into SampleContextTrimmer in SampleProf.cpp. This enables llvm-profdata to use the trimmer when merging profiles. Changes in llvm-profgen will be in separate patch.

Differential Revision: https://reviews.llvm.org/D100090

show more ...


Revision tags: llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4
# 3e3fc431 29-Mar-2021 Hongtao Yu <hoy@fb.com>

[CSSPGO] Top-down processing order based on full profile.

Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn

[CSSPGO] Top-down processing order based on full profile.

Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:

1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.

2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.

3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.

4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.

Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.

I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.

The change is an enhancement to https://reviews.llvm.org/D95988.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D99351

show more ...


12