#
6012fed6 |
| 30-Aug-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Fix sqrt fast math flags spreading to fdiv fast math flags
This was working around the lack of operator| on FastMathFlags. We have that now which revealed the bug.
|
#
a738bdf3 |
| 16-Aug-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare
We were basing the defer the fast case to codegen based on the fdiv itself, and not looking for a foldable sqrt input.
https://reviews.llvm
AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare
We were basing the defer the fast case to codegen based on the fdiv itself, and not looking for a foldable sqrt input.
https://reviews.llvm.org/D158127
show more ...
|
#
b7503ae8 |
| 11-Aug-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Clear BreakPhiNodesCache in-between functions
Otherwise stale pointers pollute the cache and when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache
[AMDGPU] Clear BreakPhiNodesCache in-between functions
Otherwise stale pointers pollute the cache and when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157711
show more ...
|
Revision tags: llvmorg-17.0.0-rc2 |
|
#
62ea799e |
| 03-Aug-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Break Large PHIs: Take whole PHI chains into account
Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain". The concep
[AMDGPU] Break Large PHIs: Take whole PHI chains into account
Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain". The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector. We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations.
The solution I found is to consider the whole "PHI chain" when looking at PHI. That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole.
The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value. In simple terms, two-thirds (rounded up) of the PHIs should be breakable.
This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue.
Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156414
show more ...
|
Revision tags: llvmorg-17.0.0-rc1, llvmorg-18-init |
|
#
03612b2c |
| 21-Jul-2023 |
Kazu Hirata <kazu@google.com> |
[AMDGPU] Fix an unused variable warning
This patch fixes:
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1006:9: error: unused variable 'Ty' [-Werror,-Wunused-variable]
|
#
6398b687 |
| 21-Jul-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Fix variables only used in asserts
|
#
8406c356 |
| 16-Jul-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Implement new 2ulp fdiv lowering
Extends the new frexp scaled reciprocal to the general case. The reciprocal case is just the same thing when frexp of 1 is constant folded. Could probably cl
AMDGPU: Implement new 2ulp fdiv lowering
Extends the new frexp scaled reciprocal to the general case. The reciprocal case is just the same thing when frexp of 1 is constant folded. Could probably clean up the code to rely on that constant folding.
Improves results for the IEEE path for the default OpenCL division. We used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy threshold with DAZ, which uses explicit range checks. This gives us a better fast option with the default IEEE behavior.
show more ...
|
#
6699c370 |
| 19-Jul-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling
NFC-ish. Does trigger some reordering of the fdiv scalarization. Also skips scalarizing in more cases where nothing was going to happen. We can st
AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling
NFC-ish. Does trigger some reordering of the fdiv scalarization. Also skips scalarizing in more cases where nothing was going to happen. We can still scalarize in some no-op edge cases.
https://reviews.llvm.org/D155740
show more ...
|
#
8287f3af |
| 03-Jul-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Overhaul and improve rcp and rsq f32 formation
The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions
AMDGPU: Overhaul and improve rcp and rsq f32 formation
The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions compared to other implementations which performed an explicit denormal range change. This improves the OpenCL default, and requires a flag for HIP. I don't believe there's any flag wired up for OpenMP to emit the necessary fpmath metadata.
This provides several improvements and changes that were hard to separate without regressing one case or another. Disturbingly the OpenCL conformance test seems to have the reciprocal test commented out. I locally hacked it back in to test this.
Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like the rcp case, we could do this in codegen if !fpmath were preserved (although we would lose some computeKnownFPClass tricks). Start requiring contract flags to form rsq. The rsq fusion actually improves the result from ~2ulp to ~1ulp. We have some older fusion in codegen which only keys off unsafe math which should be refined.
Expand rsq patterns by checking for denormal inputs and pre/post multiplying like the current library code does. We also take advantage of computeKnownFPClass to avoid the scaling when we can statically prove the input cannot be a denormal. We could do the same for the rcp case, but unlike rsq a large input can underflow to denormal. We need additional upper bound exponent checks on the input in order to do the same for rcp.
This rsq handling also now starts handling the negated case. We introduce rsq with an fneg. In the case the fneg doesn't fold into its user, it's a neutral change but provides improvement if it is foldable as a source modifier.
Also starts respecting the arcp attribute properly, and more strictly interprets afn. We were previously interpreting afn as implying you could do the reciprocal expansion of an fdiv. The codegen handling of these also needs to be revisited.
This also effectively introduces the optimization combineRepeatedFPDivisors enables, just done in the IR instead (and only for f32).
This is almost across the board better. The one minor regression is for gfx6/buggy frexp case where for multiple reciprocals, we could previously reuse rematerialized constants per instance (it's neutral for a single rcp).
The fdiv.fast and sqrt handling need to be revisited next.
https://reviews.llvm.org/D155593
show more ...
|
#
d33ab054 |
| 19-Jul-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Add flag to disable fdiv processing in IR pass
We kind of have to have multiple implementations of fdiv split between the two selectors with some pre-processing. Add yet another test to chec
AMDGPU: Add flag to disable fdiv processing in IR pass
We kind of have to have multiple implementations of fdiv split between the two selectors with some pre-processing. Add yet another test to check for consistency of interpretation of flag combinations. We have quite a bit of test redundancy here already, but there are so many possible interesting permutations it's unwieldy to cover every detail in any one of them. We have a number of overlapping fdiv tests but it's hard to follow everything going on as it is.
show more ...
|
#
e5296c52 |
| 13-Jul-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis
The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.
This wor
[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis
The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.
This worked well in most cases, but there's one case in rocRAND where it doesn't work. In that case, a PHI node has 2 PHI users where one is breakable but not the other. When that PHI node isn't broken performance falls by 35%.
Relaxing the restriction to "require that half of the PHI node users are breakable" fixes the issue, and seems like a sensible change.
Solves SWDEV-409648, SWDEV-398393
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D155184
show more ...
|
#
fbe4ff81 |
| 14-Jun-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still s
AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still somewhat broken because it involves a mode switch and we need to query the original mode.
show more ...
|
#
64d32545 |
| 27-Jun-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Delete custom combine on class intrinsic
This is no longer necessary as class-with-constant will always be transformed to the generic class intrinsic.
https://reviews.llvm.org/D153901
|
#
9c82dc6a |
| 01-Jul-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Always use v_rcp_f16 and v_rsq_f16
These inherited the fast math checks from f32, but the manual suggests these should be accurate enough for unconditional use. The definition of correctly r
AMDGPU: Always use v_rcp_f16 and v_rsq_f16
These inherited the fast math checks from f32, but the manual suggests these should be accurate enough for unconditional use. The definition of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've been a bit nervous about changing this as the OpenCL conformance test does not cover half. Brute force produces identical values compared to a reference host implementation for all values.
show more ...
|
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5 |
|
#
a22ef958 |
| 24-May-2023 |
Anshil Gandhi <gandhi21299@gmail.com> |
[AMDGPUCodegenPrepare] Add NewPM Support
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D151241
|
#
fa87dd52 |
| 22-May-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Handle multiple occurences of an incoming value in break large PHIs
We naively broke all incoming values, assuming they'd be unique. However it's not illegal to have multiple occurences of,
[AMDGPU] Handle multiple occurences of an incoming value in break large PHIs
We naively broke all incoming values, assuming they'd be unique. However it's not illegal to have multiple occurences of, e.g. `[BB0, V0]` in a PHI node. What's illegal though is having the same basic block multiple times but with different values, and it's exactly what the transform caused. This broke in some rare applications where the pattern arised.
Now we cache the `BasicBlock, Value` pairs we're breaking so we can reuse the values and preserve this invariant.
Solves SWDEV-399460
Reviewed By: #amdgpu, rovka
Differential Revision: https://reviews.llvm.org/D151069
show more ...
|
Revision tags: llvmorg-16.0.4 |
|
#
0d0ed9a3 |
| 05-May-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare
This will allow eliminating the intrinsic uses in the device libraries, which will remove a subtarget dependency on the f16 version o
AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare
This will allow eliminating the intrinsic uses in the device libraries, which will remove a subtarget dependency on the f16 version of the intrinsic.
We previously had some wrong patterns for this under unsafe math which I've removed.
Do it in IR partially to take advantage of the much better isKnownNeverNaN handling, and partially out of laziness to avoid repeating this in the DAG and GlobalISel path. Plus I think this should be done much earlier. Ideally this would be in InstCombine, but you can't introduce target intrinsics from a generic instruction rooted pattern.
show more ...
|
#
52a2d07b |
| 10-May-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Improve PHI-breaking heuristics in CGP
D147786 made the transform more conservative by adding heuristics, which was a good idea. However, the transform got a bit too conservative at times.
[AMDGPU] Improve PHI-breaking heuristics in CGP
D147786 made the transform more conservative by adding heuristics, which was a good idea. However, the transform got a bit too conservative at times.
This caused a surprise in some rocRAND benchmarks because D143731 greatly helped a few of them. For instance, a few xorwow-uniform tests saw a +30% boost in performance after that pass, which was lost when D147786 landed.
This patch is an attempt at reaching a middleground that makes the pass a bit more permissive. It continues in the same spirit as D147786 but does the following changes: - PHI users of a PHI node are now recursively checked. When loops are encountered, we consider the PHIs non-breakable. (Considering them breakable had very negative effect in one app I tested) - `shufflevector` is now considered interesting, given that it satisfies a few trivial checks.
Reviewed By: arsenm, #amdgpu, jmmartinez
Differential Revision: https://reviews.llvm.org/D150266
show more ...
|
Revision tags: llvmorg-16.0.3 |
|
#
6a0d0711 |
| 29-Apr-2023 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Don't try to create pointer bitcasts in load widening
|
Revision tags: llvmorg-16.0.2 |
|
#
b3b3cb2d |
| 07-Apr-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Less aggressively break large PHIs
In some cases, breaking large PHIs can very negatively affect performance (3x more instructions observed in a particular test case).
This patch adds some
[AMDGPU] Less aggressively break large PHIs
In some cases, breaking large PHIs can very negatively affect performance (3x more instructions observed in a particular test case).
This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases. e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.
Fixes SWDEV-392803 Fixes SWDEV-393781 Fixes SWDEV-394228
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147786
show more ...
|
Revision tags: llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3 |
|
#
d8925210 |
| 13-Feb-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Break-up large PHIs for DAGISel
DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen. This is because it introduces a need to have a build
[AMDGPU] Break-up large PHIs for DAGISel
DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen. This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.
This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users. It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function).
This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs.
Fixes SWDEV-321581
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D143731
show more ...
|
#
dbebebf6 |
| 06-Mar-2023 |
pvanhout <pierre.vanhoutryve@amd.com> |
[AMDGPU] Use UniformityAnalysis in CodeGenPrepare
A little extra change was needed in UA because it didn't consider InvokeInst and it made call-constexpr.ll assert.
Reviewed By: sameerds, arsenm
D
[AMDGPU] Use UniformityAnalysis in CodeGenPrepare
A little extra change was needed in UA because it didn't consider InvokeInst and it made call-constexpr.ll assert.
Reviewed By: sameerds, arsenm
Differential Revision: https://reviews.llvm.org/D145358
show more ...
|
#
dcb83484 |
| 23-Feb-2023 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies future changes to make some of it depend on the subtarget.
[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies future changes to make some of it depend on the subtarget.
Differential Revision: https://reviews.llvm.org/D144650
show more ...
|
#
64dad4ba |
| 14-Feb-2023 |
Kazu Hirata <kazu@google.com> |
Use llvm::bit_cast (NFC)
|
Revision tags: llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7 |
|
#
6443c0ee |
| 12-Dec-2022 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Stop using make_pair and make_tuple. NFC.
C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple.
Differential Revision: https://reviews.l
[AMDGPU] Stop using make_pair and make_tuple. NFC.
C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple.
Differential Revision: https://reviews.llvm.org/D139828
show more ...
|