Revision tags: llvmorg-21-init |
|
#
6206f544 |
| 23-Jan-2025 |
Lucas Ramirez <11032120+lucas-rami@users.noreply.github.com> |
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies.
`getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies.
Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations.
As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024].
- A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU.
Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10].
Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary.
Fixes #118220.
show more ...
|
#
21704a68 |
| 17-Jan-2025 |
Stanislav Mekhanoshin <rampitec@users.noreply.github.com> |
[AMDGPU] Fix printing hasInitWholeWave in mir (#123232)
|
Revision tags: llvmorg-19.1.7 |
|
#
2e5c2982 |
| 10-Jan-2025 |
Austin Kerbow <Austin.Kerbow@amd.com> |
[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equi
[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.
A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
show more ...
|
#
67c55b1f |
| 18-Dec-2024 |
Ruiling, Song <ruiling.song@amd.com> |
[AMDGPU] Make max dwords of memory cluster configurable (#119342)
We find it helpful to increase the value for graphics workload. Make it
configurable so we can experiment with a different value.
|
Revision tags: llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4 |
|
#
bc7e099a |
| 07-Nov-2024 |
dyung <douglas.yung@sony.com> |
Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)
Reverts llvm/llvm-project#115291
Reverting due to test failures on many bots including
https://lab.llvm.org/buildbot/#/builde
Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)
Reverts llvm/llvm-project#115291
Reverting due to test failures on many bots including
https://lab.llvm.org/buildbot/#/builders/174/builds/8049
show more ...
|
#
21835ee2 |
| 07-Nov-2024 |
Akshat Oke <Akshat.Oke@amd.com> |
[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291)
|
#
3495d045 |
| 05-Nov-2024 |
Akshat Oke <Akshat.Oke@amd.com> |
[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129)
|
Revision tags: llvmorg-19.1.3, llvmorg-19.1.2 |
|
#
8d13e7b8 |
| 03-Oct-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
|
Revision tags: llvmorg-19.1.1 |
|
#
ac0f64f0 |
| 30-Sep-2024 |
Christudasan Devadasan <christudasan.devadasan@amd.com> |
[AMDGPU] Split vgpr regalloc pipeline (#93526)
Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There a
[AMDGPU] Split vgpr regalloc pipeline (#93526)
Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.
This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.
show more ...
|
#
23487be4 |
| 26-Sep-2024 |
Christudasan Devadasan <christudasan.devadasan@amd.com> |
[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911)
Multiple conditions exist to decide whether callee save spills/restores
are required for amdgpu_cs
[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911)
Multiple conditions exist to decide whether callee save spills/restores
are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling
conventions. This patch consolidates them all and moves to a single
place.
show more ...
|
#
e03f4271 |
| 19-Sep-2024 |
Jay Foad <jay.foad@amd.com> |
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all oc
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
show more ...
|
Revision tags: llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3 |
|
#
a5666359 |
| 19-Aug-2024 |
Christudasan Devadasan <christudasan.devadasan@amd.com> |
[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720)
This will allow us to reuse the existing flags and the static functions while building the pipeline for new pass manage
[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720)
This will allow us to reuse the existing flags and the static functions while building the pipeline for new pass manager.
show more ...
|
Revision tags: llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init |
|
#
63fae3ed |
| 17-Jul-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298)
|
#
c7309dad |
| 17-Jul-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Use range-based for loops. NFC. (#99047)
|
#
5e338f1f |
| 17-Jul-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC.
|
#
0b43d573 |
| 16-Jul-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] clang-tidy: replace macro with enum. NFC.
|
#
7e9b49f6 |
| 25-Jun-2024 |
Nicolai Hähnle <nicolai.haehnle@amd.com> |
AMDGPU: Add plumbing for private segment size argument (#96445)
The actual size of scratch/private is determined at dispatch time, so
add more plumbing to request it. Will be used in subsequent cha
AMDGPU: Add plumbing for private segment size argument (#96445)
The actual size of scratch/private is determined at dispatch time, so
add more plumbing to request it. Will be used in subsequent change.
show more ...
|
#
d6c74102 |
| 25-Jun-2024 |
Nicolai Hähnle <nicolai.haehnle@amd.com> |
AMDGPU: Remove an outdated TODO (#96446)
We have a fixed calling convention for stack pointer and frame pointer,
we shouldn't try to shift anything around.
|
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5 |
|
#
5b187751 |
| 29-Apr-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Fix typo in #89773
Fixes #90281
|
#
46163688 |
| 24-Apr-2024 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Allow WorkgroupID intrinsics in amdgpu_gfx functions (#89773)
With GFX12 architected SGPRs the workgroup ids are trivially available
in any function called from a compute entrypoint.
|
Revision tags: llvmorg-18.1.4, llvmorg-18.1.3 |
|
#
b6b703b2 |
| 21-Mar-2024 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)
SIMachineFunctionInfo has a scan of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
i
AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)
SIMachineFunctionInfo has a scan of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
into the attributor, so it actually works interprocedurally.
Could probably avoid most of the test churn if this bothered to avoid
adding this on subtargets without AGPRs. We should also probably
try to delete the MIR scan in usesAGPRs but it seems to be trickier
to eliminate.
show more ...
|
Revision tags: llvmorg-18.1.2 |
|
#
c4e517f5 |
| 12-Mar-2024 |
Jun Wang <jwang86@yahoo.com> |
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows progr
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
show more ...
|
Revision tags: llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3 |
|
#
bc6955f1 |
| 09-Feb-2024 |
Diana Picus <Diana-Magda.Picus@amd.com> |
[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
T
[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.
[1]
https://github.com/llvm/llvm-project/commit/34c8b835b16fb3879f1b9770e91df21883356bb6
show more ...
|
Revision tags: llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init |
|
#
230c13d5 |
| 24-Jan-2024 |
Christudasan Devadasan <christudasan.devadasan@amd.com> |
[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to a
[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.
This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
show more ...
|
#
51392996 |
| 17-Dec-2023 |
Carl Ritson <carl.ritson@amd.com> |
[AMDGPU] Track physical VGPRs used for SGPR spills (#75573)
Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registe
[AMDGPU] Track physical VGPRs used for SGPR spills (#75573)
Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registers
allocated during WWM pre-allocation pass.
This causes SGPR spills allocated after WWM pre-allocation to overlap
with WWM register usage, e.g. if frame pointer is spilt during
prologue/epilog insertion.
show more ...
|