History log of /llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp (Results 1 – 25 of 34)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1
# fbec1c2a 26-Sep-2024 Ellis Hoag <ellis.sparky.hoag@gmail.com>

[NFC][CodeLayout] Remove unused parameter (#110145)

The `NodeCounts` parameter of `calcExtTspScore()` is unused, so remove
it.
Use `SmallVector` since arrays are expected to be small since they
r

[NFC][CodeLayout] Remove unused parameter (#110145)

The `NodeCounts` parameter of `calcExtTspScore()` is unused, so remove
it.
Use `SmallVector` since arrays are expected to be small since they
represent MBBs.

show more ...


Revision tags: llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init
# fef144ce 25-Jun-2024 Kazu Hirata <kazu@google.com>

Revert "[llvm] Use llvm::sort (NFC) (#96434)"

This reverts commit 05d167fc201b4f2e96108be0d682f6800a70c23d.

Reverting the patch fixes the following under EXPENSIVE_CHECKS:

LLVM :: CodeGen/AMDGPU

Revert "[llvm] Use llvm::sort (NFC) (#96434)"

This reverts commit 05d167fc201b4f2e96108be0d682f6800a70c23d.

Reverting the patch fixes the following under EXPENSIVE_CHECKS:

LLVM :: CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir
LLVM :: CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir
LLVM :: CodeGen/PowerPC/aix-xcoff-used-with-stringpool.ll
LLVM :: CodeGen/PowerPC/merge-string-used-by-metadata.mir
LLVM :: CodeGen/PowerPC/mergeable-string-pool-large.ll
LLVM :: CodeGen/PowerPC/mergeable-string-pool-pass-only.mir
LLVM :: CodeGen/PowerPC/mergeable-string-pool.ll

show more ...


# 05d167fc 23-Jun-2024 Kazu Hirata <kazu@google.com>

[llvm] Use llvm::sort (NFC) (#96434)


Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5
# cebc8379 02-Nov-2023 spupyrev <spupyrev@users.noreply.github.com>

[CodeLayout] Pre-process execution counts before layout (#70501)

BOLT fails to process binaries in non-LBR mode, as some blocks marked as
having
a zero execution count. Adjusting code layout to proc

[CodeLayout] Pre-process execution counts before layout (#70501)

BOLT fails to process binaries in non-LBR mode, as some blocks marked as
having
a zero execution count. Adjusting code layout to process such blocks
without
assertions. This is NFC for all other use cases.

show more ...


Revision tags: llvmorg-17.0.4
# f61179f8 27-Oct-2023 spupyrev <spupyrev@users.noreply.github.com>

[CodeLayout] Changed option names cds to cdsort (#69668)

Renaming cds-> cdsort for consistency. This is NFC unless somebody uses
older names


# d0584e24 25-Oct-2023 Alina Sbirlea <asbirlea@google.com>

[CodeLayout] Update to resolve Wdangling warning.

Change cc2fbc648d7babbfa612f4f5eda3160212ef6ca7 introduced -Wdangling
warning, use temporaries to resolve.

llvm/lib/Transforms/Utils/CodeLayout.cpp

[CodeLayout] Update to resolve Wdangling warning.

Change cc2fbc648d7babbfa612f4f5eda3160212ef6ca7 introduced -Wdangling
warning, use temporaries to resolve.

llvm/lib/Transforms/Utils/CodeLayout.cpp:764:27: error: temporary whose address is used as value of local variable '[minDensity, maxDensity]' will be destroyed at the end of the full-expression [-Werror,-Wdangling]
764 | std::minmax(ChainPred->density(), ChainSucc->density());

llvm/lib/Transforms/Utils/CodeLayout.cpp:764:49: error: temporary whose address is used as value of local variable '[minDensity, maxDensity]' will be destroyed at the end of the full-expression [-Werror,-Wdangling]
764 | std::minmax(ChainPred->density(), ChainSucc->density());

show more ...


# cc2fbc64 25-Oct-2023 spupyrev <spupyrev@users.noreply.github.com>

[CodeLayout] Faster basic block reordering, ext-tsp (#68617)

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats _all_ blocks and jumps as "hot" having
non

[CodeLayout] Faster basic block reordering, ext-tsp (#68617)

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats _all_ blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:
- reduced the maximum chain size to 512 (from the prior 4096);
- introduced MaxMergeDensityRatio param to avoid merging chains with
very different densities;
- dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
**Before**:

- benchmark 1:
num functions: 13,047
reordering running time is 2.4 seconds
score: 125503458 (128.3102%)
- benchmark 2:
num functions: 16,438
reordering running time is 3.4 seconds
score: 12613997277 (129.7495%)
- benchmark 3:
num functions: 12,359
reordering running time is 1.9 seconds
score: 1315881613 (105.8991%)
- benchmark 4:
num functions: 96,588
reordering running time is 7.3 seconds
score: 89513906284 (100.3413%)
- benchmark 5:
num functions: 1
reordering running time is 372 seconds
score: 21292505965077 (99.9979%)
- benchmark 6:
num functions: 71,155
reordering running time is 314 seconds
score: 29795381626270671437824 (102.7519%)

**After**:
- benchmark 1:
reordering running time is 2.2 seconds
score: 125510418 (128.3130%)

- benchmark 2:
reordering running time is 2.6 seconds
score: 12614502162 (129.7525%)

- benchmark 3:
reordering running time is 1.6 seconds
score: 1315938168 (105.9024%)

- benchmark 4:
reordering running time is 4.9 seconds
score: 89518095837 (100.3454%)

- benchmark 5:
reordering running time is 4.8 seconds
score: 21292295939119 (99.9971%)

- benchmark 6:
reordering running time is 104 seconds
score: 29796710925310302879744 (102.7565%)

show more ...


# f9306f6d 25-Oct-2023 Kazu Hirata <kazu@google.com>

[ADT] Rename llvm::erase_value to llvm::erase (NFC) (#70156)

C++20 comes with std::erase to erase a value from std::vector. This
patch renames llvm::erase_value to llvm::erase for consistency with

[ADT] Rename llvm::erase_value to llvm::erase (NFC) (#70156)

C++20 comes with std::erase to erase a value from std::vector. This
patch renames llvm::erase_value to llvm::erase for consistency with
C++20.

We could make llvm::erase more similar to std::erase by having it
return the number of elements removed, but I'm not doing that for now
because nobody seems to care about that in our code base.

Since there are only 50 occurrences of erase_value in our code base,
this patch replaces all of them with llvm::erase and deprecates
llvm::erase_value.

show more ...


# a2441837 22-Oct-2023 Fangrui Song <i@maskray.me>

[CodeLayout] cache-directed sort: limit max chain size (#69039)

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).

[CodeLayout] cache-directed sort: limit max chain size (#69039)

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).
```
4.6% 20.7Mi .text.hot
3.5% 15.9Mi .text
3.4% 15.2Mi .text.unknown
```

Add cl option `cdsort-max-chain-size`, which is similar to
`ext-tsp-max-chain-size`, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort
--time-trace"
builds, the "Total Sort sections" time is measured as follows:

* -mllvm -cdsort-max-chain-size=64: 1.321813
* -mllvm -cdsort-max-chain-size=128: 2.030425
* -mllvm -cdsort-max-chain-size=256: 2.927684
* -mllvm -cdsort-max-chain-size=512: 5.493106
* unlimited: 9 minutes

The rest part takes 6.8s.

show more ...


# beffc821 18-Oct-2023 Fangrui Song <i@maskray.me>

[CodeLayout] CDSortImpl: remove HotChains and remove linear-time erase_value from mergeChains (#69276)

After mergeChainPairs initializes a priority queue, HotChains is unused
except a HotChains.siz

[CodeLayout] CDSortImpl: remove HotChains and remove linear-time erase_value from mergeChains (#69276)

After mergeChainPairs initializes a priority queue, HotChains is unused
except a HotChains.size() use in LLVM_DEBUG. Optimize it out.

show more ...


Revision tags: llvmorg-17.0.3
# 9dbfd582 13-Oct-2023 Fangrui Song <i@maskray.me>

[CodeLayout] CDSortImpl: remove two conditions that cannot trigger. NFC


# 3d75c7c1 13-Oct-2023 spupyrev <spupyrev@users.noreply.github.com>

[CodeLayout] Fixing initialization of empty ranges (#68917)

Fixing libc++'s consistency checks, by eliminating ranges of singular
iterators.


# b90fcafc 12-Oct-2023 spupyrev <spupyrev@users.noreply.github.com>

[CodeLayout][NFC] Using MergedVector to avoid extra vector allocations (#68724)

Using a wrapper (MergedVector) around vectors to avoid extra vector
allocations.
Plus a few edits in the comments.


# 5b39d8d3 06-Oct-2023 spupyrev <spupyrev@fb.com>

Revert "[CodeLayout] Faster basic block reordering, ext-tsp (#68275)"

This reverts commit 0a7bf3aad692c5bb591cac605a19980b00325d50.


# 0a7bf3aa 06-Oct-2023 spupyrev <spupyrev@users.noreply.github.com>

[CodeLayout] Faster basic block reordering, ext-tsp (#68275)

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats _all_ blocks and jumps as "hot" having
non

[CodeLayout] Faster basic block reordering, ext-tsp (#68275)

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats _all_ blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:
- reduced the maximum chain size to 512 (from the prior 4096);
- introeuced MaxMergeDensityRatio param to avoid merging chains with
very differen densities;
- dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
**Before**:

- benchmark 1:
reordering running time is 2486 milliseconds
score: 125503458 (128.3102%)
- benchmark 2:
reordering running time is 3443 milliseconds
score: 12613997277 (129.7495%)
- benchmark 2:
reordering running time is 1978 milliseconds
score: 1315881613 (105.8991%)
- benchmark 4:
reordering running time is 7364 milliseconds
score: 89513906284 (100.3413%)
- benchmark 5:
reordering running time is 372605 milliseconds
score: 21292505965077 (99.9979%)

**After**:
- benchmark 1:
reordering running time is 2498 milliseconds
score: 125510418 (128.3173%)

- benchmark 2:
reordering running time is 3201 milliseconds
score: 12614502162 (129.7547%)

- benchmark 3:
reordering running time is 2137 milliseconds
score: 1315938168 (105.9036%)

- benchmark 4:
reordering running time is 6242 milliseconds
score: 89518095837 (100.3460%)

- benchmark 5:
reordering running time is 5819 milliseconds
score: 21292295939119 (99.9969%)

show more ...


Revision tags: llvmorg-17.0.2
# 6b8d04c2 21-Sep-2023 Fangrui Song <i@maskray.me>

[CodeLayout] Refactor std::vector uses, namespace, and EdgeCountT. NFC

* Place types and functions in the llvm::codelayout namespace
* Change EdgeCountT from pair<pair<uint64_t, uint64_t>, uint64_t>

[CodeLayout] Refactor std::vector uses, namespace, and EdgeCountT. NFC

* Place types and functions in the llvm::codelayout namespace
* Change EdgeCountT from pair<pair<uint64_t, uint64_t>, uint64_t> to a struct and utilize structured bindings.
It is not conventional to use the "T" suffix for structure types.
* Remove a redundant copy in ChainT::merge.
* Change {ExtTSPImpl,CDSortImpl}::run to use return value instead of an output parameter
* Rename applyCDSLayout to computeCacheDirectedLayout: (a) avoid rare
abbreviation "CDS" (cache-directed sort) (b) "compute" is more conventional
for the specific use case
* Change the parameter types from std::vector to ArrayRef so that
SmallVector arguments can be used.
* Similarly, rename applyExtTspLayout to computeExtTspLayout.

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D159526

show more ...


Revision tags: llvmorg-17.0.1
# af935cf0 19-Sep-2023 Fangrui Song <i@maskray.me>

[CodeLayout] Fix X1_Y_X2 and Y_X2_X1 testing for jumps from Y (#66592)

The CHECK2 test in code_placement_ext_tsp_large.ll now has the same
result as
the CHECK test: when chain(0,2,3,4,1) is merged

[CodeLayout] Fix X1_Y_X2 and Y_X2_X1 testing for jumps from Y (#66592)

The CHECK2 test in code_placement_ext_tsp_large.ll now has the same
result as
the CHECK test: when chain(0,2,3,4,1) is merged with chain(8), the
result is now
chain(0,2,3,4,8,1).

Ideally we should have test coverage for
-ext-tsp-chain-split-threshold=1, but
it seems challenging to craft one. Perhaps the default value of
-ext-tsp-chain-split-threshold can be decreased as the
-ext-tsp-enable-chain-split-along-jumps heuristic is now more powerful.

show more ...


Revision tags: llvmorg-17.0.0, llvmorg-17.0.0-rc4
# b4b42bd6 25-Aug-2023 spupyrev <spupyrev@fb.com>

Cleaning up unreachable code in CodeLayout

- removing an unreachable instruction from the code (earlier code merge bug);
- silencing "unused variable" warnings.

Reviewed By: rahmanl

Differential R

Cleaning up unreachable code in CodeLayout

- removing an unreachable instruction from the code (earlier code merge bug);
- silencing "unused variable" warnings.

Reviewed By: rahmanl

Differential Revision: https://reviews.llvm.org/D158859

show more ...


Revision tags: llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init
# bc59faa8 13-Jun-2023 spupyrev <spupyrev@fb.com>

A new code layout algorithm for function reordering [2/3]

We are bringing a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is a

A new code layout algorithm for function reordering [2/3]

We are bringing a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an improvement of
top of a known heuristic, C^3. It tries to co-locate hot and frequently executed
together functions in the resulting ordering. Unlike C^3, it explores a larger
search space and have an objective closely tied to the performance of
instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort.
The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.

The algorithm shares some similarities with C^3 and an approach for basic block
reordering (ext-tsp). It works with chains (ordered lists)
of functions. Initially all chains are isolated functions. On every iteration,
we pick a pair of chains whose merging yields the biggest increase in the
objective, which is a weighted combination of frequency-based and distance-based
locality. That is, we try to co-locate hot functions together (so they can share
the cache lines) and functions frequently executed together. The merging process
stops when there is only one chain left, or when merging does not improve the
objective. In the latter case, the remaining chains are sorted by density in the
decreasing order.

**Complexity**
We regularly apply the algorithm for large data-center binaries containing 10K+
(hot) functions, and the algorithm takes only a few seconds. For some extreme
cases with 100K-1M nodes, the runtime is within minutes.

**Perf-impact**
We extensively tested the implementation extensively on a benchmark of isolated
binaries and prod services. The impact is measurable for "larger" binaries that
are front-end bound: the cpu time improvement (on top of C^3) is in the range
of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and
i-cache miss rate (up to 5%).

Reviewed By: rahmanl

Differential Revision: https://reviews.llvm.org/D152834

show more ...


# a7e13a99 13-Jun-2023 spupyrev <spupyrev@fb.com>

A new code layout algorithm for function reordering [1/3]

We are brining a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an

A new code layout algorithm for function reordering [1/3]

We are brining a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an improvement of
top of a known heuristic, C^3. It tries to co-locate hot and frequently executed
together functions in the resulting ordering. Unlike C^3, it explores a larger
search space and have an objective closely tied to the performance of
instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort.
The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.

This diff modifies the existing data structures to facilitate the implementation
(down the stack). This is a no-op change.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D152833

show more ...


Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3
# 1e692113 14-Feb-2023 Fangrui Song <i@maskray.me>

Move global namespace cl::opt inside llvm::


Revision tags: llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2
# e3518210 03-Oct-2022 Yevgeny Rouban <yrouban@azul.com>

Fix compilation of CodeLayout.cpp for MacOS

llvm/lib/Transforms/Utils/CodeLayout.cpp uses std::abs() with double argument,
which is provided by cmath header, which is not explicitly included into Co

Fix compilation of CodeLayout.cpp for MacOS

llvm/lib/Transforms/Utils/CodeLayout.cpp uses std::abs() with double argument,
which is provided by cmath header, which is not explicitly included into CodeLayout.cpp.
The implicit include in llvm/include/llvm/Support/MathExtras.h was removed in
commit 16544cbe64b81a50800a88296ef37f4873a37b25

Inserting explicit include of cmath into CodeLayout.cpp in order to fix build on MacOS.

Committed on behalf of alsemenov (Aleksei Semenov)
Reviewed By: thieta
Differential Revision: https://reviews.llvm.org/D135072

show more ...


Revision tags: llvmorg-15.0.1, llvmorg-15.0.0
# 56ea4f9b 28-Aug-2022 Kazu Hirata <kazu@google.com>

[Transforms] Qualify auto in range-based for loops (NFC)

Identified with readability-qualified-auto.


Revision tags: llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init
# 8d5b694d 15-Jul-2022 spupyrev <spupyrev@fb.com>

extending code layout alg

The diff modifies ext-tsp code layout algorithm in the following ways:
(i) fixes merging of cold block chains (this is a port of D129397);
(ii) adjusts the cost model utili

extending code layout alg

The diff modifies ext-tsp code layout algorithm in the following ways:
(i) fixes merging of cold block chains (this is a port of D129397);
(ii) adjusts the cost model utilized for optimization;
(iii) adjusts some APIs so that the implementation can be used in BOLT; this is
a prerequisite for D129895.

The only non-trivial change is (ii). Here we introduce different weights for
conditional and unconditional branches in the cost model. Based on the new model
it is slightly more important to increase the number of "fall-through
unconditional" jumps, which makes sense, as placing two blocks with an
unconditional jump next to each other reduces the number of jump instructions in
the generated code. Experimentally, this makes a mild impact on the performance;
I've seen up to 0.2%-0.3% perf win on some benchmarks.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D129893

show more ...


# 50724716 14-Aug-2022 Kazu Hirata <kazu@google.com>

[Transforms] Qualify auto in range-based for loops (NFC)

Identified with readability-qualified-auto.


12