AMDGPUCodeGenPrepare.cpp - OpenGrok history log for /llvm-project/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 6012fed6	30-Aug-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Fix sqrt fast math flags spreading to fdiv fast math flags This was working around the lack of operator\| on FastMathFlags. We have that now which revealed the bug.
# a738bdf3	16-Aug-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare We were basing the defer the fast case to codegen based on the fdiv itself, and not looking for a foldable sqrt input. https://reviews.llvm AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare We were basing the defer the fast case to codegen based on the fdiv itself, and not looking for a foldable sqrt input. https://reviews.llvm.org/D158127 show more ...
# b7503ae8	11-Aug-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Clear BreakPhiNodesCache in-between functions Otherwise stale pointers pollute the cache and when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache [AMDGPU] Clear BreakPhiNodesCache in-between functions Otherwise stale pointers pollute the cache and when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D157711 show more ...
Revision tags: llvmorg-17.0.0-rc2
# 62ea799e	03-Aug-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Break Large PHIs: Take whole PHI chains into account Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain". The concep [AMDGPU] Break Large PHIs: Take whole PHI chains into account Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain". The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector. We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations. The solution I found is to consider the whole "PHI chain" when looking at PHI. That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole. The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value. In simple terms, two-thirds (rounded up) of the PHIs should be breakable. This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue. Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D156414 show more ...
Revision tags: llvmorg-17.0.0-rc1, llvmorg-18-init
# 03612b2c	21-Jul-2023	Kazu Hirata <kazu@google.com>	[AMDGPU] Fix an unused variable warning This patch fixes: llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1006:9: error: unused variable 'Ty' [-Werror,-Wunused-variable]
# 6398b687	21-Jul-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Fix variables only used in asserts
# 8406c356	16-Jul-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Implement new 2ulp fdiv lowering Extends the new frexp scaled reciprocal to the general case. The reciprocal case is just the same thing when frexp of 1 is constant folded. Could probably cl AMDGPU: Implement new 2ulp fdiv lowering Extends the new frexp scaled reciprocal to the general case. The reciprocal case is just the same thing when frexp of 1 is constant folded. Could probably clean up the code to rely on that constant folding. Improves results for the IEEE path for the default OpenCL division. We used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy threshold with DAZ, which uses explicit range checks. This gives us a better fast option with the default IEEE behavior. show more ...
# 6699c370	19-Jul-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling NFC-ish. Does trigger some reordering of the fdiv scalarization. Also skips scalarizing in more cases where nothing was going to happen. We can st AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling NFC-ish. Does trigger some reordering of the fdiv scalarization. Also skips scalarizing in more cases where nothing was going to happen. We can still scalarize in some no-op edge cases. https://reviews.llvm.org/D155740 show more ...
# 8287f3af	03-Jul-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Overhaul and improve rcp and rsq f32 formation The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions AMDGPU: Overhaul and improve rcp and rsq f32 formation The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions compared to other implementations which performed an explicit denormal range change. This improves the OpenCL default, and requires a flag for HIP. I don't believe there's any flag wired up for OpenMP to emit the necessary fpmath metadata. This provides several improvements and changes that were hard to separate without regressing one case or another. Disturbingly the OpenCL conformance test seems to have the reciprocal test commented out. I locally hacked it back in to test this. Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like the rcp case, we could do this in codegen if !fpmath were preserved (although we would lose some computeKnownFPClass tricks). Start requiring contract flags to form rsq. The rsq fusion actually improves the result from ~2ulp to ~1ulp. We have some older fusion in codegen which only keys off unsafe math which should be refined. Expand rsq patterns by checking for denormal inputs and pre/post multiplying like the current library code does. We also take advantage of computeKnownFPClass to avoid the scaling when we can statically prove the input cannot be a denormal. We could do the same for the rcp case, but unlike rsq a large input can underflow to denormal. We need additional upper bound exponent checks on the input in order to do the same for rcp. This rsq handling also now starts handling the negated case. We introduce rsq with an fneg. In the case the fneg doesn't fold into its user, it's a neutral change but provides improvement if it is foldable as a source modifier. Also starts respecting the arcp attribute properly, and more strictly interprets afn. We were previously interpreting afn as implying you could do the reciprocal expansion of an fdiv. The codegen handling of these also needs to be revisited. This also effectively introduces the optimization combineRepeatedFPDivisors enables, just done in the IR instead (and only for f32). This is almost across the board better. The one minor regression is for gfx6/buggy frexp case where for multiple reciprocals, we could previously reuse rematerialized constants per instance (it's neutral for a single rcp). The fdiv.fast and sqrt handling need to be revisited next. https://reviews.llvm.org/D155593 show more ...
# d33ab054	19-Jul-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Add flag to disable fdiv processing in IR pass We kind of have to have multiple implementations of fdiv split between the two selectors with some pre-processing. Add yet another test to chec AMDGPU: Add flag to disable fdiv processing in IR pass We kind of have to have multiple implementations of fdiv split between the two selectors with some pre-processing. Add yet another test to check for consistency of interpretation of flag combinations. We have quite a bit of test redundancy here already, but there are so many possible interesting permutations it's unwieldy to cover every detail in any one of them. We have a number of overlapping fdiv tests but it's hard to follow everything going on as it is. show more ...
# e5296c52	13-Jul-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were. This wor [AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were. This worked well in most cases, but there's one case in rocRAND where it doesn't work. In that case, a PHI node has 2 PHI users where one is breakable but not the other. When that PHI node isn't broken performance falls by 35%. Relaxing the restriction to "require that half of the PHI node users are breakable" fixes the issue, and seems like a sensible change. Solves SWDEV-409648, SWDEV-398393 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155184 show more ...
# fbe4ff81	14-Jun-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Partially fix not respecting dynamic denormal mode The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still s AMDGPU: Partially fix not respecting dynamic denormal mode The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still somewhat broken because it involves a mode switch and we need to query the original mode. show more ...
# 64d32545	27-Jun-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Delete custom combine on class intrinsic This is no longer necessary as class-with-constant will always be transformed to the generic class intrinsic. https://reviews.llvm.org/D153901
# 9c82dc6a	01-Jul-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Always use v_rcp_f16 and v_rsq_f16 These inherited the fast math checks from f32, but the manual suggests these should be accurate enough for unconditional use. The definition of correctly r AMDGPU: Always use v_rcp_f16 and v_rsq_f16 These inherited the fast math checks from f32, but the manual suggests these should be accurate enough for unconditional use. The definition of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've been a bit nervous about changing this as the OpenCL conformance test does not cover half. Brute force produces identical values compared to a reference host implementation for all values. show more ...
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5
# a22ef958	24-May-2023	Anshil Gandhi <gandhi21299@gmail.com>	[AMDGPUCodegenPrepare] Add NewPM Support Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D151241
# fa87dd52	22-May-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Handle multiple occurences of an incoming value in break large PHIs We naively broke all incoming values, assuming they'd be unique. However it's not illegal to have multiple occurences of, [AMDGPU] Handle multiple occurences of an incoming value in break large PHIs We naively broke all incoming values, assuming they'd be unique. However it's not illegal to have multiple occurences of, e.g. `[BB0, V0]` in a PHI node. What's illegal though is having the same basic block multiple times but with different values, and it's exactly what the transform caused. This broke in some rare applications where the pattern arised. Now we cache the `BasicBlock, Value` pairs we're breaking so we can reuse the values and preserve this invariant. Solves SWDEV-399460 Reviewed By: #amdgpu, rovka Differential Revision: https://reviews.llvm.org/D151069 show more ...
Revision tags: llvmorg-16.0.4
# 0d0ed9a3	05-May-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare This will allow eliminating the intrinsic uses in the device libraries, which will remove a subtarget dependency on the f16 version o AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare This will allow eliminating the intrinsic uses in the device libraries, which will remove a subtarget dependency on the f16 version of the intrinsic. We previously had some wrong patterns for this under unsafe math which I've removed. Do it in IR partially to take advantage of the much better isKnownNeverNaN handling, and partially out of laziness to avoid repeating this in the DAG and GlobalISel path. Plus I think this should be done much earlier. Ideally this would be in InstCombine, but you can't introduce target intrinsics from a generic instruction rooted pattern. show more ...
# 52a2d07b	10-May-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Improve PHI-breaking heuristics in CGP D147786 made the transform more conservative by adding heuristics, which was a good idea. However, the transform got a bit too conservative at times. [AMDGPU] Improve PHI-breaking heuristics in CGP D147786 made the transform more conservative by adding heuristics, which was a good idea. However, the transform got a bit too conservative at times. This caused a surprise in some rocRAND benchmarks because D143731 greatly helped a few of them. For instance, a few xorwow-uniform tests saw a +30% boost in performance after that pass, which was lost when D147786 landed. This patch is an attempt at reaching a middleground that makes the pass a bit more permissive. It continues in the same spirit as D147786 but does the following changes: - PHI users of a PHI node are now recursively checked. When loops are encountered, we consider the PHIs non-breakable. (Considering them breakable had very negative effect in one app I tested) - `shufflevector` is now considered interesting, given that it satisfies a few trivial checks. Reviewed By: arsenm, #amdgpu, jmmartinez Differential Revision: https://reviews.llvm.org/D150266 show more ...
Revision tags: llvmorg-16.0.3
# 6a0d0711	29-Apr-2023	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Don't try to create pointer bitcasts in load widening
Revision tags: llvmorg-16.0.2
# b3b3cb2d	07-Apr-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Less aggressively break large PHIs In some cases, breaking large PHIs can very negatively affect performance (3x more instructions observed in a particular test case). This patch adds some [AMDGPU] Less aggressively break large PHIs In some cases, breaking large PHIs can very negatively affect performance (3x more instructions observed in a particular test case). This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases. e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason. Fixes SWDEV-392803 Fixes SWDEV-393781 Fixes SWDEV-394228 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D147786 show more ...
Revision tags: llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3
# d8925210	13-Feb-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Break-up large PHIs for DAGISel DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen. This is because it introduces a need to have a build [AMDGPU] Break-up large PHIs for DAGISel DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen. This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases. This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users. It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function). This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs. Fixes SWDEV-321581 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D143731 show more ...
# dbebebf6	06-Mar-2023	pvanhout <pierre.vanhoutryve@amd.com>	[AMDGPU] Use UniformityAnalysis in CodeGenPrepare A little extra change was needed in UA because it didn't consider InvokeInst and it made call-constexpr.ll assert. Reviewed By: sameerds, arsenm D [AMDGPU] Use UniformityAnalysis in CodeGenPrepare A little extra change was needed in UA because it didn't consider InvokeInst and it made call-constexpr.ll assert. Reviewed By: sameerds, arsenm Differential Revision: https://reviews.llvm.org/D145358 show more ...
# dcb83484	23-Feb-2023	Jay Foad <jay.foad@amd.com>	[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC. This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies future changes to make some of it depend on the subtarget. [AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC. This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies future changes to make some of it depend on the subtarget. Differential Revision: https://reviews.llvm.org/D144650 show more ...
# 64dad4ba	14-Feb-2023	Kazu Hirata <kazu@google.com>	Use llvm::bit_cast (NFC)
Revision tags: llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7
# 6443c0ee	12-Dec-2022	Jay Foad <jay.foad@amd.com>	[AMDGPU] Stop using make_pair and make_tuple. NFC. C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple. Differential Revision: https://reviews.l [AMDGPU] Stop using make_pair and make_tuple. NFC. C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple. Differential Revision: https://reviews.llvm.org/D139828 show more ...
123 4 5 6