|
Revision tags: llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2 |
|
| #
09b53296 |
| 22-Dec-2021 |
Ron Lieberman <Ron.Lieberman@amd.com> |
Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range"
This reverts commit 9075009d1fd5f2bf9aa6c2f362d2993691a316b3.
Failed amdgpu runtime buildbot # 3514
|
| #
9075009d |
| 22-Dec-2021 |
RamNalamothu <VenkataRamanaiah.Nalamothu@amd.com> |
[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added a
[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls.
But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow.
This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering.
And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`.
As an added benefit, this patch simplifies overall return instruction handling.
Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D114652
show more ...
|
|
Revision tags: llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2 |
|
| #
729bf9b2 |
| 14-Aug-2021 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Enable fixed function ABI by default
Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on
AMDGPU: Enable fixed function ABI by default
Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on uses. This resulted in more argument shuffling code anyway.
Also have the option stop implying all inputs need to be passed. This will no rely on the amdgpu-no-* attributes to avoid passing unnecessary values.
show more ...
|
| #
18f93512 |
| 19-Nov-2021 |
RamNalamothu <VenkataRamanaiah.Nalamothu@amd.com> |
[AMDGPU] Do not generate ELF symbols for the local branch target labels
The compiler was generating symbols in the final code object for local branch target labels. This bloats the code object, slow
[AMDGPU] Do not generate ELF symbols for the local branch target labels
The compiler was generating symbols in the final code object for local branch target labels. This bloats the code object, slows down the loader, and is only used to simplify disassembly.
Use '--symbolize-operands' with llvm-objdump to improve readability of the branch target operands in disassembly.
Fixes: SWDEV-312223
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D114273
show more ...
|
| #
3ce1b963 |
| 08-Sep-2021 |
Joe Nash <Joseph.Nash@amd.com> |
[AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D1095
[AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
show more ...
|
| #
722b8e0e |
| 14-Aug-2021 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Invert ABI attribute handling
Previously we assumed all callable functions did not need any implicitly passed inputs, and added attributes to functions to indicate when they were necessary.
AMDGPU: Invert ABI attribute handling
Previously we assumed all callable functions did not need any implicitly passed inputs, and added attributes to functions to indicate when they were necessary. Requiring attributes for correctness is pretty ugly, and it makes supporting indirect and external calls more complicated.
This inverts the direction of the attributes, so an undecorated function is assumed to need all implicit imputs. This enables AMDGPUAttributor by default to mark when functions are proven to not need a given input. This strips the equivalent functionality from the legacy AMDGPUAnnotateKernelFeatures pass.
However, AMDGPUAnnotateKernelFeatures is not fully removed at this point although it should be in the future. It is still necessary for the two hacky amdgpu-calls and amdgpu-stack-objects attributes, which would be better served by a trivial analysis on the IR during selection. Additionally, AMDGPUAnnotateKernelFeatures still redundantly handles the uniform-work-group-size attribute to be removed in a future commit.
At this point when not using -amdgpu-fixed-function-abi, we are still modifying the ABI based on these newly negated attributes. In the future, this option will be removed and the locations for implicit inputs will always be fixed. We will then use the new attributes to avoid passing the values when unnecessary.
show more ...
|
| #
f3645c79 |
| 01-Sep-2021 |
Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com> |
[AMDGPU] Use S_BITCMP1_* to replace AND in optimizeCompareInstr
Differential Revision: https://reviews.llvm.org/D109082
|
| #
bf77b112 |
| 31-Aug-2021 |
Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com> |
[AMDGPU] Introduce optimizeCompareInstr
The following patterns are currently handled:
s_cmp_eq_u32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_eq_i32 (s_and_b32 $src, 1), 1 => s_and_b32 $src,
[AMDGPU] Introduce optimizeCompareInstr
The following patterns are currently handled:
s_cmp_eq_u32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_eq_i32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_eq_u64 (s_and_b64 $src, 1), 1 => s_and_b64 $src, 1 s_cmp_ge_u32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_ge_i32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_lg_u32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 s_cmp_lg_i32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 s_cmp_lg_u64 (s_and_b64 $src, 1), 0 => s_and_b64 $src, 1 s_cmp_gt_u32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 s_cmp_gt_i32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1
Differential Revision: https://reviews.llvm.org/D109031
show more ...
|
| #
e3cbf1d4 |
| 01-Sep-2021 |
alex-t <alexander.timofeev@amd.com> |
[AMDGPU] enable scalar compare in truncate selection
Currently, the truncate selection dag node is expanded as a bitwise AND plus compare to 1. This change enables scalar comparison in the pattern
[AMDGPU] enable scalar compare in truncate selection
Currently, the truncate selection dag node is expanded as a bitwise AND plus compare to 1. This change enables scalar comparison in the pattern if the truncate node is uniform.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108925
show more ...
|
|
Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2 |
|
| #
96e1fcb1 |
| 07-Jun-2021 |
Sebastian Neubauer <sebastian.neubauer@amd.com> |
[AMDGPU] Use s_add_i32 for address additions
This allows to convert the add instruction to s_addk_i32 and v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU instruction.
Differe
[AMDGPU] Use s_add_i32 for address additions
This allows to convert the add instruction to s_addk_i32 and v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU instruction.
Differential Revision: https://reviews.llvm.org/D103322
show more ...
|
|
Revision tags: llvmorg-12.0.1-rc1, llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4 |
|
| #
5682ae2f |
| 25-Mar-2021 |
madhur13490 <Madhur.Amilkanthwar@amd.com> |
[AMDGPU] Set implicit arg attributes for indirect calls
This patch adds attributes corresponding to implicits to functions/kernels if 1. it has an indirect call OR 2. it's address is taken.
Once su
[AMDGPU] Set implicit arg attributes for indirect calls
This patch adds attributes corresponding to implicits to functions/kernels if 1. it has an indirect call OR 2. it's address is taken.
Once such attributes are set, rest of the codegen would work out-of-box for indirect calls. This patch eliminates the potential overhead -fixed-abi imposes even though indirect functions calls are not used.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D99347
show more ...
|
|
Revision tags: llvmorg-12.0.0-rc3, llvmorg-12.0.0-rc2 |
|
| #
3c297a25 |
| 10-Feb-2021 |
madhur13490 <Madhur.Amilkanthwar@amd.com> |
Make fixed-abi default for AMD HSA OS
fixed-abi uses pre-defined and predictable SGPR/VGPRs for passing arguments. This patch makes this scheme default when HSA OS is specified in triple.
Reviewed
Make fixed-abi default for AMD HSA OS
fixed-abi uses pre-defined and predictable SGPR/VGPRs for passing arguments. This patch makes this scheme default when HSA OS is specified in triple.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D96340
show more ...
|
|
Revision tags: llvmorg-11.1.0, llvmorg-11.1.0-rc3, llvmorg-12.0.0-rc1, llvmorg-13-init, llvmorg-11.1.0-rc2, llvmorg-11.1.0-rc1, llvmorg-11.0.1, llvmorg-11.0.1-rc2, llvmorg-11.0.1-rc1 |
|
| #
58de4b20 |
| 29-Oct-2020 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Use pseudo instructions for readlane/writelane
This reverts r227987 "R600/SI: Determine target-specific encoding of READLANE and WRITELANE early v2".
All the codegen changes are caused by
[AMDGPU] Use pseudo instructions for readlane/writelane
This reverts r227987 "R600/SI: Determine target-specific encoding of READLANE and WRITELANE early v2".
All the codegen changes are caused by the post-RA scheduler no longer treating readlane/writelane as scheduling barriers due to having unmodelled side effects. (The pseudos are hasSideEffects = 0, but the real instructions are hasSideEffects = ? which TableGen conservatively treats as 1.)
Differential Revision: https://reviews.llvm.org/D90401
show more ...
|
|
Revision tags: llvmorg-11.0.0, llvmorg-11.0.0-rc6, llvmorg-11.0.0-rc5, llvmorg-11.0.0-rc4 |
|
| #
a343b9b0 |
| 23-Sep-2020 |
Sebastian Neubauer <sebastian.neubauer@amd.com> |
Revert "[AMDGPU] Insert waitcnt after returning from call"
This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6.
According to michel.daenzer, > This completely broke the Mesa radeonsi drive
Revert "[AMDGPU] Insert waitcnt after returning from call"
This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6.
According to michel.daenzer, > This completely broke the Mesa radeonsi driver on Navi 14. Xorg + > xterm come up with major corruption & psychedelic colours.
show more ...
|
|
Revision tags: llvmorg-11.0.0-rc3 |
|
| #
ca907bfb |
| 04-Sep-2020 |
Sebastian Neubauer <sebastian.neubauer@amd.com> |
[AMDGPU] Insert waitcnt after returning from call
When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished
[AMDGPU] Insert waitcnt after returning from call
When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished. Calls need some time to be executed, so if the callee inserts the waitcnt, filling the instruction buffer and waiting for memory will be interleaved, hiding some latency. This comes at the cost of having a waitcnt inside functions that may not be needed as no memory operations are outstanding.
For function calls, this is already implemented. The same principal applies to returns: If the caller inserts a waitcnt after the call, the callee does not have to wait and the return and memory operation can be run in parallel.
This commit implements waiting in the caller after returning from a function call.
Differential Revision: https://reviews.llvm.org/D87674
show more ...
|
| #
4bdab2e8 |
| 01-Sep-2020 |
Jay Foad <jay.foad@amd.com> |
[AMDGPU] Fix offset for REL32_HI relocs
The addend in a REL32 reloc needs to be adjusted to account for the offset from the PC value returned by the s_getpc instruction to the point where the reloc
[AMDGPU] Fix offset for REL32_HI relocs
The addend in a REL32 reloc needs to be adjusted to account for the offset from the PC value returned by the s_getpc instruction to the point where the reloc is applied. This was being done correctly for (GOTPC)REL32_LO but not for (GOTPC)REL32_HI. This will only make a difference if the target symbol happens to get loaded almost exactly a multiple of 4G away from the relocated instructions.
Differential Revision: https://reviews.llvm.org/D86938
show more ...
|
|
Revision tags: llvmorg-11.0.0-rc2, llvmorg-11.0.0-rc1, llvmorg-12-init, llvmorg-10.0.1, llvmorg-10.0.1-rc4, llvmorg-10.0.1-rc3, llvmorg-10.0.1-rc2, llvmorg-10.0.1-rc1 |
|
| #
375cec4b |
| 27-Mar-2020 |
Christudasan Devadasan <Christudasan.Devadasan@amd.com> |
[AMDGPU] Introduce more scratch registers in the ABI.
The AMDGPU target has a convention that defined all VGPRs (execept the initial 32 argument registers) as callee-saved. This convention is not ef
[AMDGPU] Introduce more scratch registers in the ABI.
The AMDGPU target has a convention that defined all VGPRs (execept the initial 32 argument registers) as callee-saved. This convention is not efficient always, esp. when the callee requiring more registers, ended up emitting a large number of spills, even though its caller requires only a few.
This patch revises the ABI by introducing more scratch registers that a callee can freely use. The 256 vgpr registers now become: 32 argument registers 112 scratch registers and 112 callee saved registers. The scratch registers and the CSRs are intermixed at regular intervals (a split boundary of 8) to obtain a better occupancy.
Reviewers: arsenm, t-tye, rampitec, b-sumner, mjbedy, tpr
Reviewed By: arsenm, t-tye
Differential Revision: https://reviews.llvm.org/D76356
show more ...
|
| #
72e87549 |
| 06-Apr-2020 |
Konstantin Pyzhov <Konstantin.Pyzhov@amd.com> |
[AMDGPU] Disable 'Skip Uniform Regions' optimization by default for AMDGPU.
Reviewers: sameerds, dstuttard
Differential Revision: https://reviews.llvm.org/D77228
|
| #
51dc0283 |
| 06-Apr-2020 |
Konstantin Pyzhov <Konstantin.Pyzhov@amd.com> |
Revert e1730cfeb3588f20dcf4a96b181ad52761666e52
|
| #
e1730cfe |
| 06-Apr-2020 |
Konstantin Pyzhov <Konstantin.Pyzhov@amd.com> |
[AMDGPU] Disable 'Skip Uniform Regions' optimization by default for AMDGPU.
Reviewers: sameerds, dstuttard
Differential Revision: https://reviews.llvm.org/D77228
|
|
Revision tags: llvmorg-10.0.0, llvmorg-10.0.0-rc6, llvmorg-10.0.0-rc5, llvmorg-10.0.0-rc4, llvmorg-10.0.0-rc3 |
|
| #
0e9368cc |
| 04-Mar-2020 |
Scott Linder <Scott.Linder@amd.com> |
[AMDGPU] Move frame pointer from s34 to s33
Remove the gap left between the stack pointer (s32) and frame pointer (s34) now that the scratch wave offset is no longer a part of the calling convention
[AMDGPU] Move frame pointer from s34 to s33
Remove the gap left between the stack pointer (s32) and frame pointer (s34) now that the scratch wave offset is no longer a part of the calling convention ABI.
Update llvm/docs/AMDGPUUsage.rst to reflect the change.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75657
show more ...
|
|
Revision tags: llvmorg-10.0.0-rc2, llvmorg-10.0.0-rc1 |
|
| #
60b1967c |
| 21-Jan-2020 |
Scott Linder <Scott.Linder@amd.com> |
[AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in the entry function prologue. This allows us t
[AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in the entry function prologue. This allows us to removes the scratch wave offset register from the calling convention ABI.
As part of this change, allow the use of an inline constant zero for the SOffset of MUBUF instructions accessing the stack in entry functions when a frame pointer is not requested/required. Entry functions with calls still need to set up the calling convention ABI stack pointer register, and reference it in order to address arguments of called functions. The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative.
Non-entry functions also use an inline constant zero SOffset for wave-relative scratch access, but continue to use the stack and frame pointers as before. When the stack or frame pointer is converted to a swizzled offset it is now scaled directly, as the scratch wave offset no longer needs to be subtracted first.
Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling convention.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75138
show more ...
|
|
Revision tags: llvmorg-11-init, llvmorg-9.0.1, llvmorg-9.0.1-rc3, llvmorg-9.0.1-rc2, llvmorg-9.0.1-rc1, llvmorg-9.0.0, llvmorg-9.0.0-rc6, llvmorg-9.0.0-rc5, llvmorg-9.0.0-rc4, llvmorg-9.0.0-rc3 |
|
| #
0c096da0 |
| 27-Aug-2019 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Fix crash from inconsistent register types for v3i16/v3f16
This is something of a workaround since computeRegisterProperties seems to be doing the wrong thing.
llvm-svn: 370086
|
|
Revision tags: llvmorg-9.0.0-rc2, llvmorg-9.0.0-rc1, llvmorg-10-init, llvmorg-8.0.1, llvmorg-8.0.1-rc4 |
|
| #
b2d24bd5 |
| 09-Jul-2019 |
Christudasan Devadasan <Christudasan.Devadasan@amd.com> |
[AMDGPU] Created a sub-register class for the return address operand in the return instruction.
Function return instruction lowering, currently uses the fixed register pair s[30:31] for holding the
[AMDGPU] Created a sub-register class for the return address operand in the return instruction.
Function return instruction lowering, currently uses the fixed register pair s[30:31] for holding the return address. It can be any SGPR pair other than the CSRs. Created an SGPR pair sub-register class exclusive of the CSRs, and used this regclass while lowering the return instruction.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D63924
llvm-svn: 365512
show more ...
|
| #
71dfb7ec |
| 08-Jul-2019 |
Matt Arsenault <Matthew.Arsenault@amd.com> |
AMDGPU: Make s34 the FP register
Make the FP register callee saved.
This is tricky because now the FP needs to be spilled in the prolog relative to the incoming SP register, rather than the frame r
AMDGPU: Make s34 the FP register
Make the FP register callee saved.
This is tricky because now the FP needs to be spilled in the prolog relative to the incoming SP register, rather than the frame register used throughout the rest of the function. I don't like how this bypassess the standard mechanism for CSR spills just to get the correct insert point. I may look for a better solution, since all CSR VGPRs may also need to have all lanes activated. Another option might be to make getFrameIndexReference change the base register if the frame index is a CSR, and then try to figure out the right insertion point in emitProlog.
If there is a free VGPR lane available for SGPR spilling, try to use it for the FP. If that would require intrtoducing a new VGPR spill, try to use a free call clobbered SGPR. Only fallback to introducing a new VGPR spill as a last resort.
This also doesn't attempt to handle SGPR spilling with scalar stores.
llvm-svn: 365372
show more ...
|