|
Revision tags: llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1 |
|
| #
fbec1c2a |
| 26-Sep-2024 |
Ellis Hoag <ellis.sparky.hoag@gmail.com> |
[NFC][CodeLayout] Remove unused parameter (#110145)
The `NodeCounts` parameter of `calcExtTspScore()` is unused, so remove
it.
Use `SmallVector` since arrays are expected to be small since they
r
[NFC][CodeLayout] Remove unused parameter (#110145)
The `NodeCounts` parameter of `calcExtTspScore()` is unused, so remove
it.
Use `SmallVector` since arrays are expected to be small since they
represent MBBs.
show more ...
|
|
Revision tags: llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5 |
|
| #
d4360e42 |
| 11-Nov-2023 |
Kazu Hirata <kazu@google.com> |
[llvm] Stop including llvm/ADT/DenseMap.h (NFC)
Ientified with clangd.
|
|
Revision tags: llvmorg-17.0.4 |
|
| #
a2441837 |
| 22-Oct-2023 |
Fangrui Song <i@maskray.me> |
[CodeLayout] cache-directed sort: limit max chain size (#69039)
When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).
[CodeLayout] cache-directed sort: limit max chain size (#69039)
When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).
```
4.6% 20.7Mi .text.hot
3.5% 15.9Mi .text
3.4% 15.2Mi .text.unknown
```
Add cl option `cdsort-max-chain-size`, which is similar to
`ext-tsp-max-chain-size`, and set it to 128, to improve performance.
In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort
--time-trace"
builds, the "Total Sort sections" time is measured as follows:
* -mllvm -cdsort-max-chain-size=64: 1.321813
* -mllvm -cdsort-max-chain-size=128: 2.030425
* -mllvm -cdsort-max-chain-size=256: 2.927684
* -mllvm -cdsort-max-chain-size=512: 5.493106
* unlimited: 9 minutes
The rest part takes 6.8s.
show more ...
|
|
Revision tags: llvmorg-17.0.3, llvmorg-17.0.2 |
|
| #
6b8d04c2 |
| 21-Sep-2023 |
Fangrui Song <i@maskray.me> |
[CodeLayout] Refactor std::vector uses, namespace, and EdgeCountT. NFC
* Place types and functions in the llvm::codelayout namespace * Change EdgeCountT from pair<pair<uint64_t, uint64_t>, uint64_t>
[CodeLayout] Refactor std::vector uses, namespace, and EdgeCountT. NFC
* Place types and functions in the llvm::codelayout namespace * Change EdgeCountT from pair<pair<uint64_t, uint64_t>, uint64_t> to a struct and utilize structured bindings. It is not conventional to use the "T" suffix for structure types. * Remove a redundant copy in ChainT::merge. * Change {ExtTSPImpl,CDSortImpl}::run to use return value instead of an output parameter * Rename applyCDSLayout to computeCacheDirectedLayout: (a) avoid rare abbreviation "CDS" (cache-directed sort) (b) "compute" is more conventional for the specific use case * Change the parameter types from std::vector to ArrayRef so that SmallVector arguments can be used. * Similarly, rename applyExtTspLayout to computeExtTspLayout.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D159526
show more ...
|
|
Revision tags: llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init |
|
| #
bc59faa8 |
| 13-Jun-2023 |
spupyrev <spupyrev@fb.com> |
A new code layout algorithm for function reordering [2/3]
We are bringing a new algorithm for function layout (reordering) based on the call graph (extracted from a profile data). The algorithm is a
A new code layout algorithm for function reordering [2/3]
We are bringing a new algorithm for function layout (reordering) based on the call graph (extracted from a profile data). The algorithm is an improvement of top of a known heuristic, C^3. It tries to co-locate hot and frequently executed together functions in the resulting ordering. Unlike C^3, it explores a larger search space and have an objective closely tied to the performance of instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort. The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.
The algorithm shares some similarities with C^3 and an approach for basic block reordering (ext-tsp). It works with chains (ordered lists) of functions. Initially all chains are isolated functions. On every iteration, we pick a pair of chains whose merging yields the biggest increase in the objective, which is a weighted combination of frequency-based and distance-based locality. That is, we try to co-locate hot functions together (so they can share the cache lines) and functions frequently executed together. The merging process stops when there is only one chain left, or when merging does not improve the objective. In the latter case, the remaining chains are sorted by density in the decreasing order.
**Complexity** We regularly apply the algorithm for large data-center binaries containing 10K+ (hot) functions, and the algorithm takes only a few seconds. For some extreme cases with 100K-1M nodes, the runtime is within minutes.
**Perf-impact** We extensively tested the implementation extensively on a benchmark of isolated binaries and prod services. The impact is measurable for "larger" binaries that are front-end bound: the cpu time improvement (on top of C^3) is in the range of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and i-cache miss rate (up to 5%).
Reviewed By: rahmanl
Differential Revision: https://reviews.llvm.org/D152834
show more ...
|
|
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init |
|
| #
8d5b694d |
| 15-Jul-2022 |
spupyrev <spupyrev@fb.com> |
extending code layout alg
The diff modifies ext-tsp code layout algorithm in the following ways: (i) fixes merging of cold block chains (this is a port of D129397); (ii) adjusts the cost model utili
extending code layout alg
The diff modifies ext-tsp code layout algorithm in the following ways: (i) fixes merging of cold block chains (this is a port of D129397); (ii) adjusts the cost model utilized for optimization; (iii) adjusts some APIs so that the implementation can be used in BOLT; this is a prerequisite for D129895.
The only non-trivial change is (ii). Here we introduce different weights for conditional and unconditional branches in the cost model. Based on the new model it is slightly more important to increase the number of "fall-through unconditional" jumps, which makes sense, as placing two blocks with an unconditional jump next to each other reduces the number of jump instructions in the generated code. Experimentally, this makes a mild impact on the performance; I've seen up to 0.2%-0.3% perf win on some benchmarks.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D129893
show more ...
|
|
Revision tags: llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2 |
|
| #
f85c91f1 |
| 01-Jan-2022 |
Kazu Hirata <kazu@google.com> |
[Transforms] Remove unused forward declarations (NFC)
|
|
Revision tags: llvmorg-13.0.1-rc1 |
|
| #
f573f686 |
| 08-Nov-2021 |
spupyrev <spupyrev@fb.com> |
ext-tsp basic block layout
A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG optimizing jump locality
ext-tsp basic block layout
A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG optimizing jump locality and thus processor I-cache utilization. This is achieved via increasing the number of fall-through jumps and co-locating frequently executed nodes together. The name follows the underlying optimization problem, Extended-TSP, which is a generalization of classical (maximum) Traveling Salesmen Problem.
The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks. Initially all chains are isolated basic blocks. On every iteration, we pick a pair of chains whose merging yields the biggest increase in the ExtTSP value, which models how i-cache "friendly" a specific chain is. A pair of chains giving the maximum gain is merged into a new chain. The procedure stops when there is only one chain left, or when merging does not increase ExtTSP. In the latter case, the remaining chains are sorted by density in decreasing order.
An important aspect is the way two chains are merged. Unlike earlier algorithms (e.g., based on the approach of Pettis-Hansen), two chains, X and Y, are first split into three, X1, X2, and Y. Then we consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score. This improves the quality of the final result (the search space is larger) while keeping the implementation sufficiently fast.
Differential Revision: https://reviews.llvm.org/D113424
show more ...
|
| #
c68f71eb |
| 08-Nov-2021 |
spupyrev <spupyrev@fb.com> |
ext-tsp basic block layout
A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG optimizing jump locality
ext-tsp basic block layout
A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG optimizing jump locality and thus processor I-cache utilization. This is achieved via increasing the number of fall-through jumps and co-locating frequently executed nodes together. The name follows the underlying optimization problem, Extended-TSP, which is a generalization of classical (maximum) Traveling Salesmen Problem.
The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks. Initially all chains are isolated basic blocks. On every iteration, we pick a pair of chains whose merging yields the biggest increase in the ExtTSP value, which models how i-cache "friendly" a specific chain is. A pair of chains giving the maximum gain is merged into a new chain. The procedure stops when there is only one chain left, or when merging does not increase ExtTSP. In the latter case, the remaining chains are sorted by density in decreasing order.
An important aspect is the way two chains are merged. Unlike earlier algorithms (e.g., based on the approach of Pettis-Hansen), two chains, X and Y, are first split into three, X1, X2, and Y. Then we consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score. This improves the quality of the final result (the search space is larger) while keeping the implementation sufficiently fast.
Differential Revision: https://reviews.llvm.org/D113424
show more ...
|