History log of /llvm-project/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (Results 26 – 50 of 136)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 6012fed6 30-Aug-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Fix sqrt fast math flags spreading to fdiv fast math flags

This was working around the lack of operator| on FastMathFlags. We
have that now which revealed the bug.


# a738bdf3 16-Aug-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare

We were basing the defer the fast case to codegen based on the fdiv
itself, and not looking for a foldable sqrt input.

https://reviews.llvm

AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare

We were basing the defer the fast case to codegen based on the fdiv
itself, and not looking for a foldable sqrt input.

https://reviews.llvm.org/D158127

show more ...


# b7503ae8 11-Aug-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Clear BreakPhiNodesCache in-between functions

Otherwise stale pointers pollute the cache and
when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache

[AMDGPU] Clear BreakPhiNodesCache in-between functions

Otherwise stale pointers pollute the cache and
when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D157711

show more ...


Revision tags: llvmorg-17.0.0-rc2
# 62ea799e 03-Aug-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Break Large PHIs: Take whole PHI chains into account

Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain".
The concep

[AMDGPU] Break Large PHIs: Take whole PHI chains into account

Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain".
The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector.
We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations.

The solution I found is to consider the whole "PHI chain" when looking at PHI.
That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole.

The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value.
In simple terms, two-thirds (rounded up) of the PHIs should be breakable.

This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue.

Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156414

show more ...


Revision tags: llvmorg-17.0.0-rc1, llvmorg-18-init
# 03612b2c 21-Jul-2023 Kazu Hirata <kazu@google.com>

[AMDGPU] Fix an unused variable warning

This patch fixes:

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1006:9: error:
unused variable 'Ty' [-Werror,-Wunused-variable]


# 6398b687 21-Jul-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Fix variables only used in asserts


# 8406c356 16-Jul-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Implement new 2ulp fdiv lowering

Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably cl

AMDGPU: Implement new 2ulp fdiv lowering

Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably clean up the code to rely on that constant
folding.

Improves results for the IEEE path for the default OpenCL division. We
used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy
threshold with DAZ, which uses explicit range checks. This gives us a
better fast option with the default IEEE behavior.

show more ...


# 6699c370 19-Jul-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling

NFC-ish. Does trigger some reordering of the fdiv scalarization. Also
skips scalarizing in more cases where nothing was going to happen. We
can st

AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling

NFC-ish. Does trigger some reordering of the fdiv scalarization. Also
skips scalarizing in more cases where nothing was going to happen. We
can still scalarize in some no-op edge cases.

https://reviews.llvm.org/D155740

show more ...


# 8287f3af 03-Jul-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Overhaul and improve rcp and rsq f32 formation

The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions

AMDGPU: Overhaul and improve rcp and rsq f32 formation

The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions compared to other implementations which performed an
explicit denormal range change. This improves the OpenCL default, and
requires a flag for HIP. I don't believe there's any flag wired up for
OpenMP to emit the necessary fpmath metadata.

This provides several improvements and changes that were hard to
separate without regressing one case or another. Disturbingly the
OpenCL conformance test seems to have the reciprocal test commented
out. I locally hacked it back in to test this.

Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like
the rcp case, we could do this in codegen if !fpmath were preserved
(although we would lose some computeKnownFPClass tricks). Start
requiring contract flags to form rsq. The rsq fusion actually improves
the result from ~2ulp to ~1ulp. We have some older fusion in codegen
which only keys off unsafe math which should be refined.

Expand rsq patterns by checking for denormal inputs and pre/post
multiplying like the current library code does. We also take advantage
of computeKnownFPClass to avoid the scaling when we can statically
prove the input cannot be a denormal. We could do the same for the rcp
case, but unlike rsq a large input can underflow to denormal. We need
additional upper bound exponent checks on the input in order to do the
same for rcp.

This rsq handling also now starts handling the negated case. We
introduce rsq with an fneg. In the case the fneg doesn't fold into its
user, it's a neutral change but provides improvement if it is foldable
as a source modifier.

Also starts respecting the arcp attribute properly, and more strictly
interprets afn. We were previously interpreting afn as implying you
could do the reciprocal expansion of an fdiv. The codegen handling of
these also needs to be revisited.

This also effectively introduces the optimization
combineRepeatedFPDivisors enables, just done in the IR instead (and
only for f32).

This is almost across the board better. The one minor regression is
for gfx6/buggy frexp case where for multiple reciprocals, we could
previously reuse rematerialized constants per instance (it's neutral
for a single rcp).

The fdiv.fast and sqrt handling need to be revisited next.

https://reviews.llvm.org/D155593

show more ...


# d33ab054 19-Jul-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Add flag to disable fdiv processing in IR pass

We kind of have to have multiple implementations of fdiv split between
the two selectors with some pre-processing. Add yet another test to
chec

AMDGPU: Add flag to disable fdiv processing in IR pass

We kind of have to have multiple implementations of fdiv split between
the two selectors with some pre-processing. Add yet another test to
check for consistency of interpretation of flag combinations. We have
quite a bit of test redundancy here already, but there are so many
possible interesting permutations it's unwieldy to cover every detail
in any one of them. We have a number of overlapping fdiv tests but
it's hard to follow everything going on as it is.

show more ...


# e5296c52 13-Jul-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis

The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.

This wor

[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis

The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.

This worked well in most cases, but there's one case in rocRAND where
it doesn't work. In that case, a PHI node has 2 PHI users where one is
breakable but not the other. When that PHI node isn't broken performance falls by 35%.

Relaxing the restriction to "require that half of the PHI node users are breakable" fixes the issue, and seems like a sensible change.

Solves SWDEV-409648, SWDEV-398393

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155184

show more ...


# fbe4ff81 14-Jun-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Partially fix not respecting dynamic denormal mode

The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
s

AMDGPU: Partially fix not respecting dynamic denormal mode

The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.

show more ...


# 64d32545 27-Jun-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Delete custom combine on class intrinsic

This is no longer necessary as class-with-constant will always be
transformed to the generic class intrinsic.

https://reviews.llvm.org/D153901


# 9c82dc6a 01-Jul-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Always use v_rcp_f16 and v_rsq_f16

These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly r

AMDGPU: Always use v_rcp_f16 and v_rsq_f16

These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.

show more ...


Revision tags: llvmorg-16.0.6, llvmorg-16.0.5
# a22ef958 24-May-2023 Anshil Gandhi <gandhi21299@gmail.com>

[AMDGPUCodegenPrepare] Add NewPM Support

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D151241


# fa87dd52 22-May-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Handle multiple occurences of an incoming value in break large PHIs

We naively broke all incoming values, assuming they'd be unique.
However it's not illegal to have multiple occurences of,

[AMDGPU] Handle multiple occurences of an incoming value in break large PHIs

We naively broke all incoming values, assuming they'd be unique.
However it's not illegal to have multiple occurences of, e.g. `[BB0, V0]`
in a PHI node. What's illegal though is having the same basic block
multiple times but with different values, and it's exactly what the
transform caused. This broke in some rare applications where the pattern
arised.

Now we cache the `BasicBlock, Value` pairs we're breaking so we can reuse the values and preserve this invariant.

Solves SWDEV-399460

Reviewed By: #amdgpu, rovka

Differential Revision: https://reviews.llvm.org/D151069

show more ...


Revision tags: llvmorg-16.0.4
# 0d0ed9a3 05-May-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare

This will allow eliminating the intrinsic uses in the device
libraries, which will remove a subtarget dependency on the f16
version o

AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare

This will allow eliminating the intrinsic uses in the device
libraries, which will remove a subtarget dependency on the f16
version of the intrinsic.

We previously had some wrong patterns for this under unsafe math
which I've removed.

Do it in IR partially to take advantage of the much better isKnownNeverNaN
handling, and partially out of laziness to avoid repeating this in the DAG
and GlobalISel path. Plus I think this should be done much earlier. Ideally
this would be in InstCombine, but you can't introduce target intrinsics
from a generic instruction rooted pattern.

show more ...


# 52a2d07b 10-May-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Improve PHI-breaking heuristics in CGP

D147786 made the transform more conservative by adding heuristics,
which was a good idea. However, the transform got a bit
too conservative at times.

[AMDGPU] Improve PHI-breaking heuristics in CGP

D147786 made the transform more conservative by adding heuristics,
which was a good idea. However, the transform got a bit
too conservative at times.

This caused a surprise in some rocRAND benchmarks because D143731 greatly helped a few of them.
For instance, a few xorwow-uniform tests saw a +30% boost in performance after that pass, which was lost when D147786 landed.

This patch is an attempt at reaching a middleground that makes
the pass a bit more permissive. It continues in the same spirit as
D147786 but does the following changes:
- PHI users of a PHI node are now recursively checked. When loops are encountered, we consider the PHIs non-breakable. (Considering them breakable had very negative effect in one app I tested)
- `shufflevector` is now considered interesting, given that it satisfies a few trivial checks.

Reviewed By: arsenm, #amdgpu, jmmartinez

Differential Revision: https://reviews.llvm.org/D150266

show more ...


Revision tags: llvmorg-16.0.3
# 6a0d0711 29-Apr-2023 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Don't try to create pointer bitcasts in load widening


Revision tags: llvmorg-16.0.2
# b3b3cb2d 07-Apr-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Less aggressively break large PHIs

In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).

This patch adds some

[AMDGPU] Less aggressively break large PHIs

In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).

This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.

Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D147786

show more ...


Revision tags: llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3
# d8925210 13-Feb-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Break-up large PHIs for DAGISel

DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen.
This is because it introduces a need to have a build

[AMDGPU] Break-up large PHIs for DAGISel

DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen.
This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users.
It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function).

This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs.

Fixes SWDEV-321581

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D143731

show more ...


# dbebebf6 06-Mar-2023 pvanhout <pierre.vanhoutryve@amd.com>

[AMDGPU] Use UniformityAnalysis in CodeGenPrepare

A little extra change was needed in UA because it didn't consider
InvokeInst and it made call-constexpr.ll assert.

Reviewed By: sameerds, arsenm

D

[AMDGPU] Use UniformityAnalysis in CodeGenPrepare

A little extra change was needed in UA because it didn't consider
InvokeInst and it made call-constexpr.ll assert.

Reviewed By: sameerds, arsenm

Differential Revision: https://reviews.llvm.org/D145358

show more ...


# dcb83484 23-Feb-2023 Jay Foad <jay.foad@amd.com>

[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.

This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.

[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.

This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.

Differential Revision: https://reviews.llvm.org/D144650

show more ...


# 64dad4ba 14-Feb-2023 Kazu Hirata <kazu@google.com>

Use llvm::bit_cast (NFC)


Revision tags: llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7
# 6443c0ee 12-Dec-2022 Jay Foad <jay.foad@amd.com>

[AMDGPU] Stop using make_pair and make_tuple. NFC.

C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.

Differential Revision: https://reviews.l

[AMDGPU] Stop using make_pair and make_tuple. NFC.

C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.

Differential Revision: https://reviews.llvm.org/D139828

show more ...


123456