History log of /llvm-project/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp (Results 1 – 25 of 214)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-21-init
# 6206f544 23-Jan-2025 Lucas Ramirez <11032120+lucas-rami@users.noreply.github.com>

[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)

Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on

[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)

Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.

show more ...


# 21704a68 17-Jan-2025 Stanislav Mekhanoshin <rampitec@users.noreply.github.com>

[AMDGPU] Fix printing hasInitWholeWave in mir (#123232)


Revision tags: llvmorg-19.1.7
# 2e5c2982 10-Jan-2025 Austin Kerbow <Austin.Kerbow@amd.com>

[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)

Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equi

[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)

Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.

A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.

show more ...


# 67c55b1f 18-Dec-2024 Ruiling, Song <ruiling.song@amd.com>

[AMDGPU] Make max dwords of memory cluster configurable (#119342)

We find it helpful to increase the value for graphics workload. Make it
configurable so we can experiment with a different value.


Revision tags: llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4
# bc7e099a 07-Nov-2024 dyung <douglas.yung@sony.com>

Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)

Reverts llvm/llvm-project#115291

Reverting due to test failures on many bots including
https://lab.llvm.org/buildbot/#/builde

Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)

Reverts llvm/llvm-project#115291

Reverting due to test failures on many bots including
https://lab.llvm.org/buildbot/#/builders/174/builds/8049

show more ...


# 21835ee2 07-Nov-2024 Akshat Oke <Akshat.Oke@amd.com>

[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291)


# 3495d045 05-Nov-2024 Akshat Oke <Akshat.Oke@amd.com>

[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129)


Revision tags: llvmorg-19.1.3, llvmorg-19.1.2
# 8d13e7b8 03-Oct-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] Qualify auto. NFC. (#110878)

Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)


Revision tags: llvmorg-19.1.1
# ac0f64f0 30-Sep-2024 Christudasan Devadasan <christudasan.devadasan@amd.com>

[AMDGPU] Split vgpr regalloc pipeline (#93526)

Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There a

[AMDGPU] Split vgpr regalloc pipeline (#93526)

Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.

This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.

show more ...


# 23487be4 26-Sep-2024 Christudasan Devadasan <christudasan.devadasan@amd.com>

[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911)

Multiple conditions exist to decide whether callee save spills/restores
are required for amdgpu_cs

[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911)

Multiple conditions exist to decide whether callee save spills/restores
are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling
conventions. This patch consolidates them all and moves to a single
place.

show more ...


# e03f4271 19-Sep-2024 Jay Foad <jay.foad@amd.com>

[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)

It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all oc

[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)

It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.

show more ...


Revision tags: llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3
# a5666359 19-Aug-2024 Christudasan Devadasan <christudasan.devadasan@amd.com>

[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720)

This will allow us to reuse the existing flags and the static
functions while building the pipeline for new pass manage

[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720)

This will allow us to reuse the existing flags and the static
functions while building the pipeline for new pass manager.

show more ...


Revision tags: llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init
# 63fae3ed 17-Jul-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298)


# c7309dad 17-Jul-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] Use range-based for loops. NFC. (#99047)


# 5e338f1f 17-Jul-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC.


# 0b43d573 16-Jul-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] clang-tidy: replace macro with enum. NFC.


# 7e9b49f6 25-Jun-2024 Nicolai Hähnle <nicolai.haehnle@amd.com>

AMDGPU: Add plumbing for private segment size argument (#96445)

The actual size of scratch/private is determined at dispatch time, so
add more plumbing to request it. Will be used in subsequent cha

AMDGPU: Add plumbing for private segment size argument (#96445)

The actual size of scratch/private is determined at dispatch time, so
add more plumbing to request it. Will be used in subsequent change.

show more ...


# d6c74102 25-Jun-2024 Nicolai Hähnle <nicolai.haehnle@amd.com>

AMDGPU: Remove an outdated TODO (#96446)

We have a fixed calling convention for stack pointer and frame pointer,
we shouldn't try to shift anything around.


Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5
# 5b187751 29-Apr-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] Fix typo in #89773

Fixes #90281


# 46163688 24-Apr-2024 Jay Foad <jay.foad@amd.com>

[AMDGPU] Allow WorkgroupID intrinsics in amdgpu_gfx functions (#89773)

With GFX12 architected SGPRs the workgroup ids are trivially available
in any function called from a compute entrypoint.


Revision tags: llvmorg-18.1.4, llvmorg-18.1.3
# b6b703b2 21-Mar-2024 Matt Arsenault <Matthew.Arsenault@amd.com>

AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)

SIMachineFunctionInfo has a scan of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
i

AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)

SIMachineFunctionInfo has a scan of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
into the attributor, so it actually works interprocedurally.

Could probably avoid most of the test churn if this bothered to avoid
adding this on subtargets without AGPRs. We should also probably
try to delete the MIR scan in usesAGPRs but it seems to be trickier
to eliminate.

show more ...


Revision tags: llvmorg-18.1.2
# c4e517f5 12-Mar-2024 Jun Wang <jwang86@yahoo.com>

[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)

A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows progr

[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)

A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>

show more ...


Revision tags: llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3
# bc6955f1 09-Feb-2024 Diana Picus <Diana-Magda.Picus@amd.com>

[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)

At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
T

[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)

At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.

[1]
https://github.com/llvm/llvm-project/commit/34c8b835b16fb3879f1b9770e91df21883356bb6

show more ...


Revision tags: llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init
# 230c13d5 24-Jan-2024 Christudasan Devadasan <christudasan.devadasan@amd.com>

[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)

CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to a

[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)

CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.

This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.

show more ...


# 51392996 17-Dec-2023 Carl Ritson <carl.ritson@amd.com>

[AMDGPU] Track physical VGPRs used for SGPR spills (#75573)

Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registe

[AMDGPU] Track physical VGPRs used for SGPR spills (#75573)

Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registers
allocated during WWM pre-allocation pass.

This causes SGPR spills allocated after WWM pre-allocation to overlap
with WWM register usage, e.g. if frame pointer is spilt during
prologue/epilog insertion.

show more ...


123456789