load-unaligned.ll - OpenGrok history log for /llvm-project/llvm/test/CodeGen/AMDGPU/GlobalISel/load-unaligned.ll

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3
# 3277c7cd	21-Oct-2024	Stanislav Mekhanoshin <rampitec@users.noreply.github.com>	[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for [AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots. show more ...
Revision tags: llvmorg-19.1.2
# c36f9023	10-Oct-2024	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU/GlobalISel: Insert m0 initialization before sextload/zextload (#111720) Fixes missing m0 initialize for pre-gfx9 targets with local extending loads.
Revision tags: llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init
# 7fa7a08f	19-Jul-2023	Jay Foad <jay.foad@amd.com>	[AMDGPU] Insert s_nop before s_sendmsg sendmsg(MSG_DEALLOC_VGPRS) Differential Revision: https://reviews.llvm.org/D155681
# f2c164c8	21-Jun-2023	Jay Foad <jay.foad@amd.com>	[AMDGPU] Do not wait for vscnt on function entry and return SIInsertWaitcnts inserts waitcnt instructions to resolve data dependencies. The GFX10+ vscnt (VMEM store count) counter is never used in t [AMDGPU] Do not wait for vscnt on function entry and return SIInsertWaitcnts inserts waitcnt instructions to resolve data dependencies. The GFX10+ vscnt (VMEM store count) counter is never used in this way. It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer. Hence there is no need to conservatively wait for vscnt to be 0 on function entry and before returns. Differential Revision: https://reviews.llvm.org/D153537 show more ...
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6
# 8e0fadda	28-Nov-2022	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU: Bulk update all GlobalISel tests to use opaque pointers
Revision tags: llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init
# 5cae8816	06-Jul-2022	Jay Foad <jay.foad@amd.com>	[AMDGPU] Add GFX11 test coverage Add GFX11 test coverage to a bunch of tests where it was easy to do so, mostly because the checks are autogenerated and/or GFX11 can share the same checks as GFX10. [AMDGPU] Add GFX11 test coverage Add GFX11 test coverage to a bunch of tests where it was easy to do so, mostly because the checks are autogenerated and/or GFX11 can share the same checks as GFX10. Differential Revision: https://reviews.llvm.org/D129295 show more ...
Revision tags: llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1
# f6462a26	11-Apr-2022	Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	[AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance [AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b128 aligned by 16: 4.9 sec ds_write2_b64 aligned by 16: 5.1 sec ds_write2_b32 * 2 aligned by 16: 5.5 sec ds_write_b128 aligned by 1: 8.1 sec ds_write2_b64 aligned by 1: 8.7 sec ds_write2_b32 * 2 aligned by 1: 14.0 sec ds_write_b128 aligned by 2: 8.1 sec ds_write2_b64 aligned by 2: 8.7 sec ds_write2_b32 * 2 aligned by 2: 14.0 sec ds_write_b128 aligned by 4: 5.6 sec ds_write2_b64 aligned by 4: 8.7 sec ds_write2_b32 * 2 aligned by 4: 5.6 sec ds_write_b128 aligned by 8: 5.6 sec ds_write2_b64 aligned by 8: 5.1 sec ds_write2_b32 * 2 aligned by 8: 5.6 sec ds_read_b128 aligned by 16: 3.8 sec ds_read2_b64 aligned by 16: 3.8 sec ds_read2_b32 * 2 aligned by 16: 4.0 sec ds_read_b128 aligned by 1: 4.6 sec ds_read2_b64 aligned by 1: 8.1 sec ds_read2_b32 * 2 aligned by 1: 14.0 sec ds_read_b128 aligned by 2: 4.6 sec ds_read2_b64 aligned by 2: 8.1 sec ds_read2_b32 * 2 aligned by 2: 14.0 sec ds_read_b128 aligned by 4: 4.6 sec ds_read2_b64 aligned by 4: 8.1 sec ds_read2_b32 * 2 aligned by 4: 4.0 sec ds_read_b128 aligned by 8: 4.6 sec ds_read2_b64 aligned by 8: 3.8 sec ds_read2_b32 * 2 aligned by 8: 4.0 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b128 aligned by 16: 6.2 sec ds_write2_b64 aligned by 16: 7.1 sec ds_write2_b32 * 2 aligned by 16: 7.6 sec ds_write_b128 aligned by 1: 24.1 sec ds_write2_b64 aligned by 1: 25.2 sec ds_write2_b32 * 2 aligned by 1: 43.7 sec ds_write_b128 aligned by 2: 24.1 sec ds_write2_b64 aligned by 2: 25.1 sec ds_write2_b32 * 2 aligned by 2: 43.7 sec ds_write_b128 aligned by 4: 14.4 sec ds_write2_b64 aligned by 4: 25.1 sec ds_write2_b32 * 2 aligned by 4: 7.6 sec ds_write_b128 aligned by 8: 14.4 sec ds_write2_b64 aligned by 8: 7.1 sec ds_write2_b32 * 2 aligned by 8: 7.6 sec ds_read_b128 aligned by 16: 6.2 sec ds_read2_b64 aligned by 16: 6.3 sec ds_read2_b32 * 2 aligned by 16: 7.5 sec ds_read_b128 aligned by 1: 12.5 sec ds_read2_b64 aligned by 1: 24.0 sec ds_read2_b32 * 2 aligned by 1: 43.6 sec ds_read_b128 aligned by 2: 12.5 sec ds_read2_b64 aligned by 2: 24.0 sec ds_read2_b32 * 2 aligned by 2: 43.6 sec ds_read_b128 aligned by 4: 12.5 sec ds_read2_b64 aligned by 4: 24.0 sec ds_read2_b32 * 2 aligned by 4: 7.5 sec ds_read_b128 aligned by 8: 12.5 sec ds_read2_b64 aligned by 8: 6.3 sec ds_read2_b32 * 2 aligned by 8: 7.5 sec ``` Differential Revision: https://reviews.llvm.org/D123634 show more ...
# 16cf9e6d	07-Apr-2022	Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	[AMDGPU] Fix handling of gfx10 LDS misaligned access bug It was only handled for FLAT initially because we did not have unaligned DS instructions lowering. Now it is implemented but the bug is not h [AMDGPU] Fix handling of gfx10 LDS misaligned access bug It was only handled for FLAT initially because we did not have unaligned DS instructions lowering. Now it is implemented but the bug is not handled. Differential Revision: https://reviews.llvm.org/D123338 show more ...
Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3
# 04fff547	07-Mar-2022	Venkata Ramanaiah Nalamothu <VenkataRamanaiah.Nalamothu@amd.com>	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added a [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm, ronlieb Differential Revision: https://reviews.llvm.org/D114652 show more ...
Revision tags: llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3
# d8b69040	20-Jan-2022	Abinav Puthan Purayil <abinav.puthanpurayil@amd.com>	[AMDGPU] Set MemoryVT for truncstores in tblgen. GlobalISelEmitter was skipping these patterns when its predicates were checked. This patch should allow us to select d16_hi stores in GlobalISel. Di [AMDGPU] Set MemoryVT for truncstores in tblgen. GlobalISelEmitter was skipping these patterns when its predicates were checked. This patch should allow us to select d16_hi stores in GlobalISel. Differential Revision: https://reviews.llvm.org/D117762 show more ...
Revision tags: llvmorg-13.0.1-rc2
# 09b53296	22-Dec-2021	Ron Lieberman <Ron.Lieberman@amd.com>	Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range" This reverts commit 9075009d1fd5f2bf9aa6c2f362d2993691a316b3. Failed amdgpu runtime buildbot # 3514
# 9075009d	22-Dec-2021	RamNalamothu <VenkataRamanaiah.Nalamothu@amd.com>	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added a [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D114652 show more ...
Revision tags: llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init
# c2229724	26-Jul-2021	Matt Arsenault <Matthew.Arsenault@amd.com>	AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting These actions should only be used for adjusting the register types (and the memory type as needed to satisfy the regi AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting These actions should only be used for adjusting the register types (and the memory type as needed to satisfy the register type). Unaligned accesses should be split as a type of lowering. This has the effect of improving the code in many cases since now we produce zextloads instead of separate loads with ands. The load/store legality rules still seem far more complicated than necessary though. show more ...
# da067ed5	10-Nov-2021	Austin Kerbow <Austin.Kerbow@amd.com>	[AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memor [AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777 show more ...
# d2e66d7f	06-Sep-2021	Konstantin Schwarz <konstantin.schwarz@hightec-rt.com>	[GlobalISel] Add a combine for and(load , mask) -> zextload This only handles simple masks, not shifted masks, for now. Reviewed By: aemerson Differential Revision: https://reviews.llvm.org/D109357
# 3ce1b963	08-Sep-2021	Joe Nash <Joseph.Nash@amd.com>	[AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D1095 [AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109536 Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde show more ...
Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2
# 31a9659d	07-Jun-2021	Matt Arsenault <Matthew.Arsenault@amd.com>	GlobalISel: Avoid use of G_INSERT in insertParts G_INSERT legalization is incomplete and doesn't work very well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES padding with undef va GlobalISel: Avoid use of G_INSERT in insertParts G_INSERT legalization is incomplete and doesn't work very well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES padding with undef values (although this can get pretty large). For the case of load/store narrowing, this is still performing the load/stores in irregularly sized pieces. It might be cleaner to split this down into equal sized pieces, and rely on load/store merging to optimize it. show more ...
Revision tags: llvmorg-12.0.1-rc1
# f6985a19	10-May-2021	Petar Avramovic <Petar.Avramovic@amd.com>	AMDGPU/GlobalISel: Use destination register bank in applyMappingLoad Large loads on target that does not useFlatForGlobal have to be split in regbankselect. This did not happen in case when destinat AMDGPU/GlobalISel: Use destination register bank in applyMappingLoad Large loads on target that does not useFlatForGlobal have to be split in regbankselect. This did not happen in case when destination had vgpr bank and address had sgpr bank. Instead of checking if address bank is sgpr check bank of the destination. Differential Revision: https://reviews.llvm.org/D101992 show more ...
# caf1294d	26-Apr-2021	Baptiste Saleil <baptiste.saleil@amd.com>	[AMDGPU] Experiments show that the GCNRegBankReassign pass significantly impacts the compilation time and there is no case for which we see any improvement in performance. This patch removes this pas [AMDGPU] Experiments show that the GCNRegBankReassign pass significantly impacts the compilation time and there is no case for which we see any improvement in performance. This patch removes this pass and its associated test cases from the tree. Differential Revision: https://reviews.llvm.org/D101313 Change-Id: I0599169a7609c19a887f8d847a71e664030cc141 show more ...
# ac64995c	08-Apr-2021	hsmahesha <mahesha.comp@gmail.com>	[AMDGPU] Only use ds_read/write_b128 for alignment >= 16 PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100008
Revision tags: llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4
# b082e6f8	29-Mar-2021	Petar Avramovic <Petar.Avramovic@amd.com>	[AMDGPU] Extend gfx10 test coverage. NFC. Differential Revision: https://reviews.llvm.org/D99267
Revision tags: llvmorg-12.0.0-rc3, llvmorg-12.0.0-rc2, llvmorg-11.1.0, llvmorg-11.1.0-rc3, llvmorg-12.0.0-rc1, llvmorg-13-init, llvmorg-11.1.0-rc2, llvmorg-11.1.0-rc1
# a9543469	05-Jan-2021	Mircea Trofin <mtrofin@google.com>	[NFC] Removed unused prefixes in CodeGen/AMDGPU/GlobalISel Differential Revision: https://reviews.llvm.org/D94099
Revision tags: llvmorg-11.0.1, llvmorg-11.0.1-rc2, llvmorg-11.0.1-rc1, llvmorg-11.0.0, llvmorg-11.0.0-rc6, llvmorg-11.0.0-rc5, llvmorg-11.0.0-rc4
# a343b9b0	23-Sep-2020	Sebastian Neubauer <sebastian.neubauer@amd.com>	Revert "[AMDGPU] Insert waitcnt after returning from call" This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6. According to michel.daenzer, > This completely broke the Mesa radeonsi drive Revert "[AMDGPU] Insert waitcnt after returning from call" This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6. According to michel.daenzer, > This completely broke the Mesa radeonsi driver on Navi 14. Xorg + > xterm come up with major corruption & psychedelic colours. show more ...
Revision tags: llvmorg-11.0.0-rc3
# ca907bfb	04-Sep-2020	Sebastian Neubauer <sebastian.neubauer@amd.com>	[AMDGPU] Insert waitcnt after returning from call When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished [AMDGPU] Insert waitcnt after returning from call When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished. Calls need some time to be executed, so if the callee inserts the waitcnt, filling the instruction buffer and waiting for memory will be interleaved, hiding some latency. This comes at the cost of having a waitcnt inside functions that may not be needed as no memory operations are outstanding. For function calls, this is already implemented. The same principal applies to returns: If the caller inserts a waitcnt after the call, the callee does not have to wait and the return and memory operation can be run in parallel. This commit implements waiting in the caller after returning from a function call. Differential Revision: https://reviews.llvm.org/D87674 show more ...
Revision tags: llvmorg-11.0.0-rc2, llvmorg-11.0.0-rc1, llvmorg-12-init, llvmorg-10.0.1, llvmorg-10.0.1-rc4, llvmorg-10.0.1-rc3, llvmorg-10.0.1-rc2
# c799f873	03-Jun-2020	Jay Foad <jay.foad@amd.com>	[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to i [AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 show more ...
12