History log of /llvm-project/llvm/test/CodeGen/AMDGPU/ds-alignment.ll (Results 1 – 25 of 26)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4
# 6548b635 09-Nov-2024 Shilei Tian <i@tianshilei.me>

Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"

This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.


# ca33649a 08-Nov-2024 Shilei Tian <i@tianshilei.me>

Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"

This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both
hip and openmp buildbots.


# e215a1e2 08-Nov-2024 Shilei Tian <i@tianshilei.me>

[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)


Revision tags: llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init
# b1bcb7ca 15-Jul-2024 Matt Arsenault <Matthew.Arsenault@amd.com>

Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)

This reverts commit adaff46d087799

Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)

This reverts commit adaff46d087799072438dd744b038e6fd50a2d78.

Drop the -O3 checks from default-attributes.hip. I don't know why they
are different on some bots but reverting this is far too disruptive.

show more ...


# adaff46d 15-Jul-2024 dyung <douglas.yung@sony.com>

Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)

This reverts commits 677cc15e0ff2e0

Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)

This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and
78bc1b64a6dc3fb6191355a5e1b502be8b3668e7.

The test CodeGenHIP/default-attributes.hip is failing on multiple bots
even after the attempted fix including the following:
- https://lab.llvm.org/buildbot/#/builders/3/builds/1473
- https://lab.llvm.org/buildbot/#/builders/65/builds/1380
- https://lab.llvm.org/buildbot/#/builders/161/builds/595
- https://lab.llvm.org/buildbot/#/builders/154/builds/1372
- https://lab.llvm.org/buildbot/#/builders/133/builds/1547
- https://lab.llvm.org/buildbot/#/builders/81/builds/755
- https://lab.llvm.org/buildbot/#/builders/40/builds/570
- https://lab.llvm.org/buildbot/#/builders/13/builds/748
- https://lab.llvm.org/buildbot/#/builders/12/builds/1845
- https://lab.llvm.org/buildbot/#/builders/11/builds/1695
- https://lab.llvm.org/buildbot/#/builders/190/builds/1829
- https://lab.llvm.org/buildbot/#/builders/193/builds/962
- https://lab.llvm.org/buildbot/#/builders/23/builds/991
- https://lab.llvm.org/buildbot/#/builders/144/builds/2256
- https://lab.llvm.org/buildbot/#/builders/46/builds/1614

These bots have been broken for a day, so reverting to get everything
back to green.

show more ...


# 78bc1b64 14-Jul-2024 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Move attributor into optimization pipeline (#83131)

Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.

AMDGPU: Move attributor into optimization pipeline (#83131)

Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.

Mostly mechanical, but there are some creative test updates. I preferred
to take the changes as-is in tests where the ABI isn't relevant. In
cases where it's more relevant, or the optimize out logic was too
ingrained in the test, I pre-run the optimization. Some cases manually
add attributes to disable inputs.

show more ...


Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init
# 9e9907f1 17-Jan-2024 Fangrui Song <i@maskray.me>

[AMDGPU,test] Change llc -march= to -mtriple= (#75982)

Similar to 806761a7629df268c8aed49657aeccffa6bca449.

For IR files without a target triple, -mtriple= specifies the full
target triple while

[AMDGPU,test] Change llc -march= to -mtriple= (#75982)

Similar to 806761a7629df268c8aed49657aeccffa6bca449.

For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.

Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.

This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:

```
LLVM :: CodeGen/AMDGPU/fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fabs.ll
LLVM :: CodeGen/AMDGPU/floor.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```

show more ...


Revision tags: llvmorg-17.0.6, llvmorg-17.0.5
# a4196666 13-Nov-2023 Jay Foad <jay.foad@amd.com>

[AMDGPU] Revert "Preliminary patch for divergence driven instruction selection. Operands Folding 1." (#71710)

This reverts commit 201f892b3b597f24287ab6a712a286e25a45a7d9.


Revision tags: llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6
# a70d5e25 07-Jun-2023 Amaury Séchet <deadalnix@gmail.com>

[DAGCombine] Make sure combined nodes are added back to the worklist in topological order.

Currently, a node and its users are added back to the worklist in reverse topological order after it is com

[DAGCombine] Make sure combined nodes are added back to the worklist in topological order.

Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D127115

show more ...


# c9998ec1 05-Jun-2023 JP Lehr <JanPatrick.Lehr@amd.com>

Revert "[DAGCombine] Make sure combined nodes are added back to the worklist in topological order."

This reverts commit e69fa03ddd85812be3143d79a0359c3e8d43bd45.

This patch lead to build time outs

Revert "[DAGCombine] Make sure combined nodes are added back to the worklist in topological order."

This reverts commit e69fa03ddd85812be3143d79a0359c3e8d43bd45.

This patch lead to build time outs on the AMDGPU OpenMP runtime
buildbot.

show more ...


Revision tags: llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4
# e69fa03d 30-Apr-2022 Amaury Séchet <deadalnix@gmail.com>

[DAGCombine] Make sure combined nodes are added back to the worklist in topological order.

Currently, a node and its users are added back to the worklist in reverse topological order after it is com

[DAGCombine] Make sure combined nodes are added back to the worklist in topological order.

Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D127115

show more ...


# d85e849f 02-Dec-2022 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Convert some assorted tests to opaque pointers


# 69d5a038 28-Jul-2022 Simon Pilgrim <llvm-dev@redking.me.uk>

[DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits

This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL

[DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits

This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.

This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits.

There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP.

Differential Revision: https://reviews.llvm.org/D77804

show more ...


Revision tags: llvmorg-14.0.3, llvmorg-14.0.2
# ac94073d 12-Apr-2022 Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>

[AMDGPU] Refine 64 bit misaligned LDS ops selection

Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64

[AMDGPU] Refine 64 bit misaligned LDS ops selection

Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64 aligned by 8: 3.2 sec
ds_write2_b32 aligned by 8: 3.2 sec
ds_write_b16 * 4 aligned by 8: 7.0 sec
ds_write_b8 * 8 aligned by 8: 13.2 sec
ds_write_b64 aligned by 1: 7.3 sec
ds_write2_b32 aligned by 1: 7.5 sec
ds_write_b16 * 4 aligned by 1: 14.0 sec
ds_write_b8 * 8 aligned by 1: 13.2 sec
ds_write_b64 aligned by 2: 7.3 sec
ds_write2_b32 aligned by 2: 7.5 sec
ds_write_b16 * 4 aligned by 2: 7.1 sec
ds_write_b8 * 8 aligned by 2: 13.3 sec
ds_write_b64 aligned by 4: 4.6 sec
ds_write2_b32 aligned by 4: 3.2 sec
ds_write_b16 * 4 aligned by 4: 7.1 sec
ds_write_b8 * 8 aligned by 4: 13.3 sec
ds_read_b64 aligned by 8: 2.3 sec
ds_read2_b32 aligned by 8: 2.2 sec
ds_read_u16 * 4 aligned by 8: 4.8 sec
ds_read_u8 * 8 aligned by 8: 8.6 sec
ds_read_b64 aligned by 1: 4.4 sec
ds_read2_b32 aligned by 1: 7.3 sec
ds_read_u16 * 4 aligned by 1: 14.0 sec
ds_read_u8 * 8 aligned by 1: 8.7 sec
ds_read_b64 aligned by 2: 4.4 sec
ds_read2_b32 aligned by 2: 7.3 sec
ds_read_u16 * 4 aligned by 2: 4.8 sec
ds_read_u8 * 8 aligned by 2: 8.7 sec
ds_read_b64 aligned by 4: 4.4 sec
ds_read2_b32 aligned by 4: 2.3 sec
ds_read_u16 * 4 aligned by 4: 4.8 sec
ds_read_u8 * 8 aligned by 4: 8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64 aligned by 8: 4.4 sec
ds_write2_b32 aligned by 8: 4.3 sec
ds_write_b16 * 4 aligned by 8: 7.9 sec
ds_write_b8 * 8 aligned by 8: 13.0 sec
ds_write_b64 aligned by 1: 23.2 sec
ds_write2_b32 aligned by 1: 23.1 sec
ds_write_b16 * 4 aligned by 1: 44.0 sec
ds_write_b8 * 8 aligned by 1: 13.0 sec
ds_write_b64 aligned by 2: 23.2 sec
ds_write2_b32 aligned by 2: 23.1 sec
ds_write_b16 * 4 aligned by 2: 7.9 sec
ds_write_b8 * 8 aligned by 2: 13.1 sec
ds_write_b64 aligned by 4: 13.5 sec
ds_write2_b32 aligned by 4: 4.3 sec
ds_write_b16 * 4 aligned by 4: 7.9 sec
ds_write_b8 * 8 aligned by 4: 13.1 sec
ds_read_b64 aligned by 8: 3.5 sec
ds_read2_b32 aligned by 8: 3.4 sec
ds_read_u16 * 4 aligned by 8: 5.3 sec
ds_read_u8 * 8 aligned by 8: 8.5 sec
ds_read_b64 aligned by 1: 13.1 sec
ds_read2_b32 aligned by 1: 22.7 sec
ds_read_u16 * 4 aligned by 1: 43.9 sec
ds_read_u8 * 8 aligned by 1: 7.9 sec
ds_read_b64 aligned by 2: 13.1 sec
ds_read2_b32 aligned by 2: 22.7 sec
ds_read_u16 * 4 aligned by 2: 5.6 sec
ds_read_u8 * 8 aligned by 2: 7.9 sec
ds_read_b64 aligned by 4: 13.1 sec
ds_read2_b32 aligned by 4: 3.4 sec
ds_read_u16 * 4 aligned by 4: 5.6 sec
ds_read_u8 * 8 aligned by 4: 7.9 sec
```

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Differential Revision: https://reviews.llvm.org/D123956

show more ...


Revision tags: llvmorg-14.0.1
# f6462a26 11-Apr-2022 Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>

[AMDGPU] Split unaligned 4 DWORD DS operations

Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance

[AMDGPU] Split unaligned 4 DWORD DS operations

Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b128 aligned by 16: 4.9 sec
ds_write2_b64 aligned by 16: 5.1 sec
ds_write2_b32 * 2 aligned by 16: 5.5 sec
ds_write_b128 aligned by 1: 8.1 sec
ds_write2_b64 aligned by 1: 8.7 sec
ds_write2_b32 * 2 aligned by 1: 14.0 sec
ds_write_b128 aligned by 2: 8.1 sec
ds_write2_b64 aligned by 2: 8.7 sec
ds_write2_b32 * 2 aligned by 2: 14.0 sec
ds_write_b128 aligned by 4: 5.6 sec
ds_write2_b64 aligned by 4: 8.7 sec
ds_write2_b32 * 2 aligned by 4: 5.6 sec
ds_write_b128 aligned by 8: 5.6 sec
ds_write2_b64 aligned by 8: 5.1 sec
ds_write2_b32 * 2 aligned by 8: 5.6 sec
ds_read_b128 aligned by 16: 3.8 sec
ds_read2_b64 aligned by 16: 3.8 sec
ds_read2_b32 * 2 aligned by 16: 4.0 sec
ds_read_b128 aligned by 1: 4.6 sec
ds_read2_b64 aligned by 1: 8.1 sec
ds_read2_b32 * 2 aligned by 1: 14.0 sec
ds_read_b128 aligned by 2: 4.6 sec
ds_read2_b64 aligned by 2: 8.1 sec
ds_read2_b32 * 2 aligned by 2: 14.0 sec
ds_read_b128 aligned by 4: 4.6 sec
ds_read2_b64 aligned by 4: 8.1 sec
ds_read2_b32 * 2 aligned by 4: 4.0 sec
ds_read_b128 aligned by 8: 4.6 sec
ds_read2_b64 aligned by 8: 3.8 sec
ds_read2_b32 * 2 aligned by 8: 4.0 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b128 aligned by 16: 6.2 sec
ds_write2_b64 aligned by 16: 7.1 sec
ds_write2_b32 * 2 aligned by 16: 7.6 sec
ds_write_b128 aligned by 1: 24.1 sec
ds_write2_b64 aligned by 1: 25.2 sec
ds_write2_b32 * 2 aligned by 1: 43.7 sec
ds_write_b128 aligned by 2: 24.1 sec
ds_write2_b64 aligned by 2: 25.1 sec
ds_write2_b32 * 2 aligned by 2: 43.7 sec
ds_write_b128 aligned by 4: 14.4 sec
ds_write2_b64 aligned by 4: 25.1 sec
ds_write2_b32 * 2 aligned by 4: 7.6 sec
ds_write_b128 aligned by 8: 14.4 sec
ds_write2_b64 aligned by 8: 7.1 sec
ds_write2_b32 * 2 aligned by 8: 7.6 sec
ds_read_b128 aligned by 16: 6.2 sec
ds_read2_b64 aligned by 16: 6.3 sec
ds_read2_b32 * 2 aligned by 16: 7.5 sec
ds_read_b128 aligned by 1: 12.5 sec
ds_read2_b64 aligned by 1: 24.0 sec
ds_read2_b32 * 2 aligned by 1: 43.6 sec
ds_read_b128 aligned by 2: 12.5 sec
ds_read2_b64 aligned by 2: 24.0 sec
ds_read2_b32 * 2 aligned by 2: 43.6 sec
ds_read_b128 aligned by 4: 12.5 sec
ds_read2_b64 aligned by 4: 24.0 sec
ds_read2_b32 * 2 aligned by 4: 7.5 sec
ds_read_b128 aligned by 8: 12.5 sec
ds_read2_b64 aligned by 8: 6.3 sec
ds_read2_b32 * 2 aligned by 8: 7.5 sec
```

Differential Revision: https://reviews.llvm.org/D123634

show more ...


# 65b8a432 12-Apr-2022 Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>

[AMDGPU] Update ds-alignment.ll test checks. NFC.


# 3870b360 11-Apr-2022 Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>

[AMDGPU] Split unaligned 3 DWORD DS operations

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned

[AMDGPU] Split unaligned 3 DWORD DS operations

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b96 aligned: 3.4 sec
ds_write_b32 + ds_write_b64 aligned: 4.5 sec
ds_write_b32 * 3 aligned: 4.8 sec
ds_write_b96 misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec
ds_write_b32 * 3 misaligned by 1: 10.0 sec
ds_write_b96 misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec
ds_write_b32 * 3 misaligned by 2: 10.1 sec
ds_write_b96 misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec
ds_write_b32 * 3 misaligned by 4: 4.9 sec
ds_write_b96 misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec
ds_write_b32 * 3 misaligned by 8: 4.9 sec
ds_read_b96 aligned: 3.3 sec
ds_read_b32 + ds_read_b64 aligned: 4.9 sec
ds_read_b32 * 3 aligned: 2.6 sec
ds_read_b96 misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec
ds_read_b32 * 3 misaligned by 1: 10.1 sec
ds_read_b96 misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec
ds_read_b32 * 3 misaligned by 2: 10.1 sec
ds_read_b96 misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec
ds_read_b32 * 3 misaligned by 4: 2.6 sec
ds_read_b96 misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec
ds_read_b32 * 3 misaligned by 8: 2.6 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b96 aligned: 4.1 sec
ds_write_b32 + ds_write_b64 aligned: 13.0 sec
ds_write_b32 * 3 aligned: 4.5 sec
ds_write_b96 misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec
ds_write_b32 * 3 misaligned by 1: 31.5 sec
ds_write_b96 misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec
ds_write_b32 * 3 misaligned by 2: 31.5 sec
ds_write_b96 misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec
ds_write_b32 * 3 misaligned by 4: 4.5 sec
ds_write_b96 misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec
ds_write_b32 * 3 misaligned by 8: 4.5 sec
ds_read_b96 aligned: 3.8 sec
ds_read_b32 + ds_read_b64 aligned: 12.8 sec
ds_read_b32 * 3 aligned: 4.4 sec
ds_read_b96 misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec
ds_read_b32 * 3 misaligned by 1: 31.5 sec
ds_read_b96 misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec
ds_read_b32 * 3 misaligned by 2: 31.5 sec
ds_read_b96 misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec
ds_read_b32 * 3 misaligned by 4: 4.5 sec
ds_read_b96 misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec
ds_read_b32 * 3 misaligned by 8: 4.5 sec
```

Fixes: SWDEV-330802

Differential Revision: https://reviews.llvm.org/D123524

show more ...


# e66f0edb 07-Apr-2022 Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>

[AMDGPU] Split unaligned LDS access instead of scalarizing

There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://review

[AMDGPU] Split unaligned LDS access instead of scalarizing

There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://reviews.llvm.org/D123330

show more ...


Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3
# d8b69040 20-Jan-2022 Abinav Puthan Purayil <abinav.puthanpurayil@amd.com>

[AMDGPU] Set MemoryVT for truncstores in tblgen.

GlobalISelEmitter was skipping these patterns when its predicates were
checked. This patch should allow us to select d16_hi stores in
GlobalISel.

Di

[AMDGPU] Set MemoryVT for truncstores in tblgen.

GlobalISelEmitter was skipping these patterns when its predicates were
checked. This patch should allow us to select d16_hi stores in
GlobalISel.

Differential Revision: https://reviews.llvm.org/D117762

show more ...


Revision tags: llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init
# c2229724 26-Jul-2021 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting

These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the regi

AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting

These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the register
type). Unaligned accesses should be split as a type of lowering.

This has the effect of improving the code in many cases since now we
produce zextloads instead of separate loads with ands. The load/store
legality rules still seem far more complicated than necessary though.

show more ...


# da067ed5 10-Nov-2021 Austin Kerbow <Austin.Kerbow@amd.com>

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memor

[AMDGPU] Set most sched model resource's BufferSize to one

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777

show more ...


# d2e66d7f 06-Sep-2021 Konstantin Schwarz <konstantin.schwarz@hightec-rt.com>

[GlobalISel] Add a combine for and(load , mask) -> zextload

This only handles simple masks, not shifted masks, for now.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D109357


# 3ce1b963 08-Sep-2021 Joe Nash <Joseph.Nash@amd.com>

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D1095

[AMDGPU] Switch PostRA sched to MachineSched

Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde

show more ...


Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2
# 31a9659d 07-Jun-2021 Matt Arsenault <Matthew.Arsenault@amd.com>

GlobalISel: Avoid use of G_INSERT in insertParts

G_INSERT legalization is incomplete and doesn't work very
well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES
padding with undef va

GlobalISel: Avoid use of G_INSERT in insertParts

G_INSERT legalization is incomplete and doesn't work very
well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES
padding with undef values (although this can get pretty large).

For the case of load/store narrowing, this is still performing the
load/stores in irregularly sized pieces. It might be cleaner to split
this down into equal sized pieces, and rely on load/store merging to
optimize it.

show more ...


Revision tags: llvmorg-12.0.1-rc1
# ac64995c 08-Apr-2021 hsmahesha <mahesha.comp@gmail.com>

[AMDGPU] Only use ds_read/write_b128 for alignment >= 16

PS: Submitting on behalf of Jay.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100008


12