|
Revision tags: llvmorg-21-init, llvmorg-19.1.7 |
|
| #
eeac0ffa |
| 10-Jan-2025 |
Nikita Popov <npopov@redhat.com> |
Revert "[MachineLICM] Use `RegisterClassInfo::getRegPressureSetLimit` (#119826)"
This reverts commit b4e17d4a314ed87ff6b40b4b05397d4b25b6636a.
This causes a large compile-time regression.
|
| #
b4e17d4a |
| 09-Jan-2025 |
Pengcheng Wang <wangpengcheng.pp@bytedance.com> |
[MachineLICM] Use `RegisterClassInfo::getRegPressureSetLimit` (#119826)
`RegisterClassInfo::getRegPressureSetLimit` is a wrapper of `TargetRegisterInfo::getRegPressureSetLimit` with some logics to a
[MachineLICM] Use `RegisterClassInfo::getRegPressureSetLimit` (#119826)
`RegisterClassInfo::getRegPressureSetLimit` is a wrapper of `TargetRegisterInfo::getRegPressureSetLimit` with some logics to adjust the limit by removing reserved registers.
It seems that we shouldn't use `TargetRegisterInfo::getRegPressureSetLimit` directly, just like the comment "This limit must be adjusted dynamically for reserved registers" said.
Separate from https://github.com/llvm/llvm-project/pull/118787
show more ...
|
|
Revision tags: llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4 |
|
| #
6548b635 |
| 09-Nov-2024 |
Shilei Tian <i@tianshilei.me> |
Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
|
| #
ca33649a |
| 08-Nov-2024 |
Shilei Tian <i@tianshilei.me> |
Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.
|
| #
e215a1e2 |
| 08-Nov-2024 |
Shilei Tian <i@tianshilei.me> |
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)
|
|
Revision tags: llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4 |
|
| #
5f6172f0 |
| 22-Aug-2024 |
Sameer Sahasrabuddhe <sameer.sahasrabuddhe@amd.com> |
[Transforms] Refactor CreateControlFlowHub (#103013)
CreateControlFlowHub is a method that redirects control flow edges from a set of incoming blocks to a set of outgoing blocks through a new set of
[Transforms] Refactor CreateControlFlowHub (#103013)
CreateControlFlowHub is a method that redirects control flow edges from a set of incoming blocks to a set of outgoing blocks through a new set of "guard" blocks. This is now refactored into a separate file with one enhancement: The input to the method is now a set of branches rather than two sets of blocks.
The original implementation reroutes every edge from incoming blocks to outgoing blocks. But it is possible that for some incoming block InBB, some successor S might be in the set of outgoing blocks, but that particular edge should not be rerouted. The new implementation makes this possible by allowing the user to specify the targets of each branch that need to be rerouted.
This is needed when improving the implementation of FixIrreducible #101386. Current use in FixIrreducible does not demonstrate this finer control over the edges being rerouted. But in UnifyLoopExits, when only one successor of an exiting block is an exit block, this refinement now reroutes only the relevant control-flow through the edge; the non-exit successor is not rerouted. This results in fewer branches and PHI nodes in the hub.
show more ...
|
|
Revision tags: llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init |
|
| #
b1bcb7ca |
| 15-Jul-2024 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commit adaff46d087799
Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commit adaff46d087799072438dd744b038e6fd50a2d78.
Drop the -O3 checks from default-attributes.hip. I don't know why they are different on some bots but reverting this is far too disruptive.
show more ...
|
| #
adaff46d |
| 15-Jul-2024 |
dyung <douglas.yung@sony.com> |
Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commits 677cc15e0ff2e0
Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and
78bc1b64a6dc3fb6191355a5e1b502be8b3668e7.
The test CodeGenHIP/default-attributes.hip is failing on multiple bots
even after the attempted fix including the following:
- https://lab.llvm.org/buildbot/#/builders/3/builds/1473
- https://lab.llvm.org/buildbot/#/builders/65/builds/1380
- https://lab.llvm.org/buildbot/#/builders/161/builds/595
- https://lab.llvm.org/buildbot/#/builders/154/builds/1372
- https://lab.llvm.org/buildbot/#/builders/133/builds/1547
- https://lab.llvm.org/buildbot/#/builders/81/builds/755
- https://lab.llvm.org/buildbot/#/builders/40/builds/570
- https://lab.llvm.org/buildbot/#/builders/13/builds/748
- https://lab.llvm.org/buildbot/#/builders/12/builds/1845
- https://lab.llvm.org/buildbot/#/builders/11/builds/1695
- https://lab.llvm.org/buildbot/#/builders/190/builds/1829
- https://lab.llvm.org/buildbot/#/builders/193/builds/962
- https://lab.llvm.org/buildbot/#/builders/23/builds/991
- https://lab.llvm.org/buildbot/#/builders/144/builds/2256
- https://lab.llvm.org/buildbot/#/builders/46/builds/1614
These bots have been broken for a day, so reverting to get everything
back to green.
show more ...
|
| #
78bc1b64 |
| 14-Jul-2024 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Move attributor into optimization pipeline (#83131)
Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.
AMDGPU: Move attributor into optimization pipeline (#83131)
Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.
Mostly mechanical, but there are some creative test updates. I preferred
to take the changes as-is in tests where the ABI isn't relevant. In
cases where it's more relevant, or the optimize out logic was too
ingrained in the test, I pre-run the optimization. Some cases manually
add attributes to disable inputs.
show more ...
|
|
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init |
|
| #
ff68e43c |
| 20-Jul-2023 |
Jingu Kang <jingu.kang@arm.com> |
[MachineLICM] Handle Subloops
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle
[MachineLICM] Handle Subloops
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle subloops with only visiting outermost loop's blocks once.
Differential Revision: https://reviews.llvm.org/D154205
show more ...
|
|
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4 |
|
| #
2be0abb7 |
| 04-May-2023 |
Justin Lebar <justin.lebar@gmail.com> |
Rewrite load-store-vectorizer.
The motivation for this change is a workload generated by the XLA compiler targeting nvidia GPUs.
This kernel has a few hundred i8 loads and stores. Merging is criti
Rewrite load-store-vectorizer.
The motivation for this change is a workload generated by the XLA compiler targeting nvidia GPUs.
This kernel has a few hundred i8 loads and stores. Merging is critical for performance.
The current LSV doesn't merge these well because it only considers instructions within a block of 64 loads+stores. This limit is necessary to contain the O(n^2) behavior of the pass. I'm hesitant to increase the limit, because this pass is already one of the slowest parts of compiling an XLA program.
So we rewrite basically the whole thing to use a new algorithm. Before, we compared every load/store to every other to see if they're consecutive. The insight (from tra@) is that this is redundant. If we know the offset from PtrA to PtrB, then we don't need to compare PtrC to both of them in order to tell whether C may be adjacent to A or B.
So that's what we do. When scanning a basic block, we maintain a list of chains, where we know the offset from every element in the chain to the first element in the chain. Each instruction gets compared only to the leaders of all the chains.
In the worst case, this is still O(n^2), because all chains might be of length 1. To prevent compile time blowup, we only consider the 64 most recently used chains. Thus we do no more comparisons than before, but we have the potential to make much longer chains.
This rewrite affects many tests. The changes to tests fall into two categories.
1. The old code had what appears to be a bug when deciding whether a misaligned vectorized load is fast. Suppose TTI reports that load <i32 x 4> align 4 has relative speed 1, and suppose that load i32 align 4 has relative speed 32.
The intent of the code seems to be that we prefer the scalar load, because it's faster. But the old code would choose the vectorized load. accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not even call into TTI to get the relative speed), because the scalar load is aligned.
After this patch, we will prefer the scalar load if it's faster.
2. This patch changes the logic for how we vectorize. Usually this results in vectorizing more.
Explanation of changes to tests:
- AMDGPU/adjust-alloca-alignment.ll: #1 - AMDGPU/flat_atomic.ll: #2, we vectorize more. - AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this. Before, we'd vectorize in case 1 and not case 2. Now we vectorize in case 2 and not case 1. So we just move the call. - AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more - AMDGPU/insertion-point.ll: #2 we vectorize more - AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1) - AMDGPU/multiple_tails.ll: #1 - AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above). - AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more. - NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above). - NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split) - X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic. - X86/subchain-interleaved.ll: #2, we vectorize more - X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float> - X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical. It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0. - X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll
Differential Revision: https://reviews.llvm.org/D149893
show more ...
|
|
Revision tags: llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7 |
|
| #
10cef708 |
| 01-Dec-2022 |
Nicolai Hähnle <nicolai.haehnle@amd.com> |
AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to take into account the number of SIMDs per "CU" or, to be more precise, the nu
AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to take into account the number of SIMDs per "CU" or, to be more precise, the number of SIMDs over which a workgroup may be distributed.
getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs into account at all.
At the same time, we need to take into account that WGP mode offers access to a larger total amount of LDS, since this can affect how non-power-of-two LDS allocations are rounded. To make this work consistently, we distinguish between (available) local memory size and addressable local memory size (which is always limited by 64kB on gfx10+, even with WGP mode).
This change results in a massive amount of test churn. A lot of it is caused by the fact that the default work group size is 1024, which means that (due to rounding effects) the default occupancy on older hardware is 8 instead of 10, which affects scheduling via register pressure estimates. I've adjusted most tests by just running the UTC tools, but in some cases I manually changed the work group size to 32 or 64 to make sure that work group size chunkiness has no effect.
Differential Revision: https://reviews.llvm.org/D139468
show more ...
|
| #
5ecd3632 |
| 05-Dec-2022 |
Jonas Paulsson <paulsson@linux.vnet.ibm.com> |
Reapply "[CodeGen] Add new pass for late cleanup of redundant definitions."
This reverts commit 122efef8ee9be57055d204d52c38700fe933c033.
- Patch fixed to not reuse definitions from predecessors in
Reapply "[CodeGen] Add new pass for late cleanup of redundant definitions."
This reverts commit 122efef8ee9be57055d204d52c38700fe933c033.
- Patch fixed to not reuse definitions from predecessors in EH landing pads. - Late review suggestions (by MaskRay) have been addressed. - M68k/pipeline.ll test updated. - Init captures added in processBlock() to avoid capturing structured bindings. - RISCV has this disabled for now.
Original commit message:
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions whenever found by scanning the MF once while keeping track of register definitions in a map. These instructions are typically immediate loads resulting from rematerialization, and address loads emitted by target in eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of 'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it is done by looking at phys-regs, but still quite effective. It would be desirable to improve other parts of CodeGen and avoid these redundant instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
show more ...
|
| #
122efef8 |
| 04-Dec-2022 |
Jonas Paulsson <paulsson@linux.vnet.ibm.com> |
Revert "Reapply "[CodeGen] Add new pass for late cleanup of redundant definitions.""
This reverts commit 17db0de330f943833296ae72e26fa988bba39cb3.
Some more bots got broken - need to investigate.
|
| #
17db0de3 |
| 01-Dec-2022 |
Jonas Paulsson <paulsson@linux.vnet.ibm.com> |
Reapply "[CodeGen] Add new pass for late cleanup of redundant definitions."
Init captures added in processBlock() to avoid capturing structured bindings, which caused the build problems (with clang)
Reapply "[CodeGen] Add new pass for late cleanup of redundant definitions."
Init captures added in processBlock() to avoid capturing structured bindings, which caused the build problems (with clang).
RISCV has this disabled for now until problems relating to post RA pseudo expansions are resolved.
show more ...
|
| #
8ef46326 |
| 01-Dec-2022 |
Jonas Paulsson <paulsson@linux.vnet.ibm.com> |
Revert "[CodeGen] Add new pass for late cleanup of redundant definitions."
Temporarily revert and fix buildbot failure.
This reverts commit 6d12599fd4134c1da63198c74a25490d28c733f6.
|
|
Revision tags: llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1 |
|
| #
6d12599f |
| 06-Apr-2022 |
Jonas Paulsson <paulsson@linux.vnet.ibm.com> |
[CodeGen] Add new pass for late cleanup of redundant definitions.
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instru
[CodeGen] Add new pass for late cleanup of redundant definitions.
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions whenever found by scanning the MF once while keeping track of register definitions in a map. These instructions are typically immediate loads resulting from rematerialization, and address loads emitted by target in eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of 'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it is done by looking at phys-regs, but still quite effective. It would be desirable to improve other parts of CodeGen and avoid these redundant instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
show more ...
|
| #
b32a5666 |
| 27-Oct-2022 |
Brendon Cahoon <brendon.cahoon@amd.com> |
[AMDGPU] Unify uniform return and divergent unreachable blocks
This patch fixes a "failed to annotate CFG" error in SIAnnotateControlFlow. The problem occurs when there are divergent and uniform unr
[AMDGPU] Unify uniform return and divergent unreachable blocks
This patch fixes a "failed to annotate CFG" error in SIAnnotateControlFlow. The problem occurs when there are divergent and uniform unreachable/return blocks in the same region. In this case, AMDGPUUnifyDivergentExitNodes does not create a unified block so the region contains multiple exits.
StructurizeCFG does not work properly when there are multiple exits, so the neccessary CFG transformations do not occur along divergent control flow. Subsequently, SIAnnotateControlFlow processes the path to the divergent exit block, but may only partially process blocks along a unform control flow path to another exit block.
This patch fixes the bug by creating a single exit block when there is a divergent exit block in the function.
Differential revision: https://reviews.llvm.org/D136892
show more ...
|
| #
a38db7bf |
| 04-Nov-2022 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Fix test failure
|
| #
09d38dd7 |
| 25-Oct-2022 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Fix assert when trying to overextend liverange
This was trying to add segments beyond the new and use, so skip additional segments.
This would hit (S < E && "Cannot create empty or backward
AMDGPU: Fix assert when trying to overextend liverange
This was trying to add segments beyond the new and use, so skip additional segments.
This would hit (S < E && "Cannot create empty or backwards segment").
show more ...
|