Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2 |
|
#
f71ad19c |
| 30-Sep-2023 |
David Green <david.green@arm.com> |
[AArch64] Add a target feature for AArch64StorePairSuppress
The AArch64StorePairSuppress pass prevents the creation of STP under some heuristics. Unfortunately it often prevents the creation of STP
[AArch64] Add a target feature for AArch64StorePairSuppress
The AArch64StorePairSuppress pass prevents the creation of STP under some heuristics. Unfortunately it often prevents the creation of STP in cases where it is obviously beneficial, and it doesn't match my understanding of scheduling/cpu pipelining to prevent the creation of STP. From some benchmarking, even on an in-order cpu where the scheduling is most important I don't see it giving better results. In general the lower instruction count for STP would be expected to give a slightly better cycle count.
As the pass specifically mentions the cyclone cpu, this patch adds a target feature for FeatureStorePairSuppress, enabled for all the non-Arm cpus. This has the effect of disabling it for all Arm cpus.
Differential Revision: https://reviews.llvm.org/D134646
show more ...
|
Revision tags: llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1 |
|
#
db158c7c |
| 28-Jul-2023 |
Harvin Iriawan <harvin.iriawan@arm.com> |
[AArch64] Update generic sched model to A510
Refresh of the generic scheduling model to use A510 instead of A55. Main benefits are to the little core, and introducing SVE scheduling information.
[AArch64] Update generic sched model to A510
Refresh of the generic scheduling model to use A510 instead of A55. Main benefits are to the little core, and introducing SVE scheduling information. Changes tested on various OoO cores, no performance degradation is seen.
Differential Revision: https://reviews.llvm.org/D156799
show more ...
|
Revision tags: llvmorg-18-init |
|
#
6e54fcce |
| 01-Jul-2023 |
Igor Kudrin <ikudrin@accesssoftek.com> |
[AArch64] Emit fewer CFI instructions for synchronous unwind tables
The instruction-precise, or asynchronous, unwind tables usually take up much more space than the synchronous ones. If a user is co
[AArch64] Emit fewer CFI instructions for synchronous unwind tables
The instruction-precise, or asynchronous, unwind tables usually take up much more space than the synchronous ones. If a user is concerned about the load size of the program and does not need the features provided with the asynchronous tables, the compiler should be able to generate the more compact variant.
This patch changes the generation of CFI instructions for these cases so that they all come in one chunk in the prolog; it emits only one `.cfi_def_cfa*` instruction followed by `.cfi_offset` ones after all stack adjustments and register spills, and avoids generating CFI instructions in the epilog(s) as well as any other exceeding CFI instructions like `.cfi_remember_state` and `.cfi_restore_state`. Effectively, it reverses the effects of D111411 and D114545 on functions with the `uwtable(sync)` attribute. As a side effect, it also restores the behavior on functions that have neither `uwtable` nor `nounwind` attributes.
Differential Revision: https://reviews.llvm.org/D153098
show more ...
|
Revision tags: llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4 |
|
#
fc26ab36 |
| 24-Feb-2023 |
Chen Zheng <czhengsz@cn.ibm.com> |
[DAGCombiner] don't use the pointer info for widen store
The merged store touches memory for other underlying objects, so mapping the merged store to the first underlying object is not correct. For
[DAGCombiner] don't use the pointer info for widen store
The merged store touches memory for other underlying objects, so mapping the merged store to the first underlying object is not correct. For example in https://github.com/llvm/llvm-project/issues/60744, the merged store is not correctly analyzed as dependent with memory operations which are also part of the merged store.
Fixes #60744
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D144711
show more ...
|
Revision tags: llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7 |
|
#
5ddce70e |
| 19-Dec-2022 |
Nikita Popov <npopov@redhat.com> |
[AArch64] Convert some tests to opaque pointers (NFC)
|
Revision tags: llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4 |
|
#
572fc7d2 |
| 23-May-2022 |
Andre Vieira <andre.simoesdiasvieira@arm.com> |
[AArch64] Order STP Q's by ascending address
This patch adds an AArch64 specific PostRA MachineScheduler to try to schedule STP Q's to the same base-address in ascending order of offsets. We have fo
[AArch64] Order STP Q's by ascending address
This patch adds an AArch64 specific PostRA MachineScheduler to try to schedule STP Q's to the same base-address in ascending order of offsets. We have found this to improve performance on Neoverse N1 and should not hurt other AArch64 cores.
Differential Revision: https://reviews.llvm.org/D125377
show more ...
|
Revision tags: llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1 |
|
#
626039cd |
| 11-Apr-2022 |
Alexander Shaposhnikov <ashaposhnikov@google.com> |
[AArch64] Split fuse-literals feature
This diff splits fuse-literals feature and enables fuse-adrp-add by default, in particular, it adjusts instruction scheduling to place ADRP+ADD pairs together.
[AArch64] Split fuse-literals feature
This diff splits fuse-literals feature and enables fuse-adrp-add by default, in particular, it adjusts instruction scheduling to place ADRP+ADD pairs together. This also enables the linker to apply the relaxations described in https://github.com/ARM-software/abi-aa/commit/d2ca58c54b8e955cfef25c71822f837ae0439d73.
Differential revision: https://reviews.llvm.org/D120104
Test plan: make check-all
show more ...
|
#
50a97aac |
| 24-Mar-2022 |
Momchil Velikov <momchil.velikov@arm.com> |
[AArch64] Async unwind - function prologues
Re-commit of 32e8b550e5439c7e4aafa73894faffd5f25d0d05
This patch rearranges emission of CFI instructions, so the resulting DWARF and `.eh_frame` informat
[AArch64] Async unwind - function prologues
Re-commit of 32e8b550e5439c7e4aafa73894faffd5f25d0d05
This patch rearranges emission of CFI instructions, so the resulting DWARF and `.eh_frame` information is precise at every instruction.
The current state is that the unwind info is emitted only after the function prologue. This is fine for synchronous (e.g. C++) exceptions, but the information is generally incorrect when the program counter is at an instruction in the prologue or the epilogue, for example:
``` stp x29, x30, [sp, #-16]! // 16-byte Folded Spill mov x29, sp .cfi_def_cfa w29, 16 ... ```
after the `stp` is executed the (initial) rule for the CFA still says the CFA is in the `sp`, even though it's already offset by 16 bytes
A correct unwind info could look like: ``` stp x29, x30, [sp, #-16]! // 16-byte Folded Spill .cfi_def_cfa_offset 16 mov x29, sp .cfi_def_cfa w29, 16 ... ```
Having this information precise up to an instruction is useful for sampling profilers that would like to get a stack backtrace. The end goal (towards this patch is just a step) is to have fully working `-fasynchronous-unwind-tables`.
Reviewed By: danielkiss, MaskRay
Differential Revision: https://reviews.llvm.org/D111411
show more ...
|
Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3 |
|
#
85c53c70 |
| 04-Mar-2022 |
Hans Wennborg <hans@chromium.org> |
Revert "[AArch64] Async unwind - function prologues"
It caused builds to assert with:
(StackSize == 0 && "We already have the CFA offset!"), function generateCompactUnwindEncoding, file AArch64
Revert "[AArch64] Async unwind - function prologues"
It caused builds to assert with:
(StackSize == 0 && "We already have the CFA offset!"), function generateCompactUnwindEncoding, file AArch64AsmBackend.cpp, line 624.
when targeting iOS. See comment on the code review for reproducer.
> This patch rearranges emission of CFI instructions, so the resulting > DWARF and `.eh_frame` information is precise at every instruction. > > The current state is that the unwind info is emitted only after the > function prologue. This is fine for synchronous (e.g. C++) exceptions, > but the information is generally incorrect when the program counter is > at an instruction in the prologue or the epilogue, for example: > > ``` > stp x29, x30, [sp, #-16]! // 16-byte Folded Spill > mov x29, sp > .cfi_def_cfa w29, 16 > ... > ``` > > after the `stp` is executed the (initial) rule for the CFA still says > the CFA is in the `sp`, even though it's already offset by 16 bytes > > A correct unwind info could look like: > ``` > stp x29, x30, [sp, #-16]! // 16-byte Folded Spill > .cfi_def_cfa_offset 16 > mov x29, sp > .cfi_def_cfa w29, 16 > ... > ``` > > Having this information precise up to an instruction is useful for > sampling profilers that would like to get a stack backtrace. The end > goal (towards this patch is just a step) is to have fully working > `-fasynchronous-unwind-tables`. > > Reviewed By: danielkiss, MaskRay > > Differential Revision: https://reviews.llvm.org/D111411
This reverts commit 32e8b550e5439c7e4aafa73894faffd5f25d0d05.
show more ...
|
Revision tags: llvmorg-14.0.0-rc2 |
|
#
32e8b550 |
| 28-Feb-2022 |
Momchil Velikov <momchil.velikov@arm.com> |
[AArch64] Async unwind - function prologues
This patch rearranges emission of CFI instructions, so the resulting DWARF and `.eh_frame` information is precise at every instruction.
The current state
[AArch64] Async unwind - function prologues
This patch rearranges emission of CFI instructions, so the resulting DWARF and `.eh_frame` information is precise at every instruction.
The current state is that the unwind info is emitted only after the function prologue. This is fine for synchronous (e.g. C++) exceptions, but the information is generally incorrect when the program counter is at an instruction in the prologue or the epilogue, for example:
``` stp x29, x30, [sp, #-16]! // 16-byte Folded Spill mov x29, sp .cfi_def_cfa w29, 16 ... ```
after the `stp` is executed the (initial) rule for the CFA still says the CFA is in the `sp`, even though it's already offset by 16 bytes
A correct unwind info could look like: ``` stp x29, x30, [sp, #-16]! // 16-byte Folded Spill .cfi_def_cfa_offset 16 mov x29, sp .cfi_def_cfa w29, 16 ... ```
Having this information precise up to an instruction is useful for sampling profilers that would like to get a stack backtrace. The end goal (towards this patch is just a step) is to have fully working `-fasynchronous-unwind-tables`.
Reviewed By: danielkiss, MaskRay
Differential Revision: https://reviews.llvm.org/D111411
show more ...
|
Revision tags: llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1 |
|
#
adec9223 |
| 09-Oct-2021 |
David Green <david.green@arm.com> |
[AArch64] Make -mcpu=generic schedule for an in-order core
We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting
[AArch64] Make -mcpu=generic schedule for an in-order core
We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model.
The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster.
On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general.
When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used.
A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined.
Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average.
Differential Revision: https://reviews.llvm.org/D110830
show more ...
|
Revision tags: llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3 |
|
#
2c5590ad |
| 10-Sep-2021 |
David Green <david.green@arm.com> |
[AArch64] Regenerate some test checks. NFC
This updates some mostly update_test_check test files and generates the check lines with the script, making them more maintainable.
|
Revision tags: llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2 |
|
#
e4ecd83f |
| 09-Jun-2021 |
David Spickett <david.spickett@linaro.org> |
[llvm][AArch64] Handle arrays of struct properly (from IR)
This only applies to FastIsel. GlobalIsel seems to sidestep the issue.
This fixes https://bugs.llvm.org/show_bug.cgi?id=46996
One of the
[llvm][AArch64] Handle arrays of struct properly (from IR)
This only applies to FastIsel. GlobalIsel seems to sidestep the issue.
This fixes https://bugs.llvm.org/show_bug.cgi?id=46996
One of the things we do in llvm is decide if a type needs consecutive registers. Previously, we just checked if it was an array or not. (plus an SVE specific check that is not changing here)
This causes some confusion when you arbitrary IR like: ``` %T1 = type { double, i1 }; define [ 1 x %T1 ] @foo() { entry: ret [ 1 x %T1 ] zeroinitializer } ```
We see it is an array so we call CC_AArch64_Custom_Block which bails out when it sees the i1, a type we don't want to put into a block.
This leaves the location of the double in some kind of intermediate state and leads to odd codegen. Which then crashes the backend because it doesn't know how to implement what it's been asked for.
You get this: ``` renamable $d0 = FMOVD0 $w0 = COPY killed renamable $d0 ```
Rather than this: ``` $d0 = FMOVD0 $w0 = COPY $wzr ```
The backend knows how to copy 64 bit to 64 bit registers, but not 64 to 32. It can certainly be taught how but the real issue seems to be us even trying to assign a register block in the first place.
This change makes the logic of AArch64TargetLowering::functionArgumentNeedsConsecutiveRegisters a bit more in depth. If we find an array, also check that all the nested aggregates in that array have a single member type.
Then CC_AArch64_Custom_Block's assumption of a type that looks like [ N x type ] will be valid and we get the expected codegen.
New tests have been added to exercise these situations. Note that some of the output is not ABI compliant. The aim of this change is to simply handle these situations and not to make our processing of arbitrary IR ABI compliant.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D104123
show more ...
|