|
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1 |
|
| #
2a859b20 |
| 28-Jul-2023 |
David Green <david.green@arm.com> |
[AArch64] Change the cost of vector insert/extract to 2
The cost of vector instructions has always been high under AArch64, in order to add a high cost for inserts/extracts, shuffles and scalarizati
[AArch64] Change the cost of vector insert/extract to 2
The cost of vector instructions has always been high under AArch64, in order to add a high cost for inserts/extracts, shuffles and scalarization. This is a conservative approach to limit the scope of unusual SLP vectorization where the codegen ends up being quite poor, but has always been higher than the correct costs would be for any specific core.
This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is also overridden for integer vector at the same time, to remove the effect of lane 0 being considered free for integer vectors (something that should only be true for float when scalarizing).
The lower insert/extract cost will reduce the cost of insert, extracts, shuffling and scalarization. The adjustments of ScalaizationOverhead will increase the cost on integer, especially for small vectors. The end result will be lower cost for float and long-integer types, some higher cost for some smaller vectors. This, along with the raw insert/extract cost being lower, will generally mean more vectorization from the Loop and SLP vectorizer.
We may end up regretting this, as that vectorization is not always profitable. In all the benchmarking I have done this is generally an improvement in the overall performance, and I've attempted to address the places where it wasn't with other costmodel adjustments.
Differential Revision: https://reviews.llvm.org/D155459
show more ...
|
|
Revision tags: llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0 |
|
| #
242203d2 |
| 30-Aug-2022 |
Mingming Liu <mingmingl@google.com> |
[AArch64][TTI] Add cost table entry for trunc over vector of integers.
1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of
[AArch64][TTI] Add cost table entry for trunc over vector of integers.
1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated. 2) Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2]) 3) It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op". For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.
[1] For {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32}, https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L4472. For concat (trunc, trunc) -> uzip1, https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L5428-L5437 [2] examples - trunc(umin(X, 255)) -> UQXTRN v8i8 (and other {u,s}x{min,max} pattern for v8i16 operands) from https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L4515-L4528 - trunc (AArch64vlshr v8i16, imm) -> SHRNv8i8 (same missed for SHRNv2i32) from https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L6743-L6748 [3] --- ; instruction latency / throughput / pipeline on `neoverse-n1` bb.0: %1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4> ; ushr, latency 2, throughput 1, pipeline V1 %2 = trunc <8 x i16> %1 to <8 x i8> ; xtn, latency 2, throughput 2, pipeline V %3 = store <8 x i8> %1, ptr %addr br cond i1 cond, label bb.1, label bb.2
bb.1: %4 = trunc <8 x i16> %1 to <8 x i8> ; xtn
bb.2: %5 = trunc <8 x i16> %1 to <8 x i8> ; xtn ---
Differential Revision: https://reviews.llvm.org/D132784
show more ...
|
| #
3785234b |
| 30-Aug-2022 |
Mingming Liu <mingmingl@google.com> |
[NFC][AArch64] Specify datalayout explicitly for cast.ll and arith-overflow.ll and update tests accordingly.
- These two tests stands out when data layout is explicitly added in a sweep study (D13
[NFC][AArch64] Specify datalayout explicitly for cast.ll and arith-overflow.ll and update tests accordingly.
- These two tests stands out when data layout is explicitly added in a sweep study (D132889)
Differential Revision: https://reviews.llvm.org/D132856
show more ...
|
|
Revision tags: llvmorg-15.0.0-rc3 |
|
| #
4178e334 |
| 10-Aug-2022 |
Simon Pilgrim <llvm-dev@redking.me.uk> |
[CostModel] Update RUN -passes=* to double quotes to appease update scripts on windows
DOS really doesn't like `` quotes to be used in command lines
Some prep work as I'm intending to resurrect D79
[CostModel] Update RUN -passes=* to double quotes to appease update scripts on windows
DOS really doesn't like `` quotes to be used in command lines
Some prep work as I'm intending to resurrect D79483 soon
show more ...
|
|
Revision tags: llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1 |
|
| #
750bf358 |
| 04-Apr-2022 |
David Green <david.green@arm.com> |
[AArch64] Increase cost of v2i64 multiplies
The cost of a v2i64 multiply was special cased in D92208 as scalarized into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are expensive
[AArch64] Increase cost of v2i64 multiplies
The cost of a v2i64 multiply was special cased in D92208 as scalarized into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are expensive though, and the cost wasn't high enough to prevent vectorizing in places where it can be detrimental for performance. This increases it so that the costs of copying to/from GPRs is increased to 2 each, with the total cost increasing to 14. So long as umull/smull are handled correctly (as in D123006) this seems to lead to better vectorization factors and better performance.
Differential Revision: https://reviews.llvm.org/D123007
show more ...
|
|
Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3 |
|
| #
65c0e45a |
| 03-Mar-2022 |
David Green <david.green@arm.com> |
[AArch64] Vector shifts cost 1
The costs of vector shifts was 2 as opposed to 1, as the nodes are marked custom. Fix this like the others and mark the nodes as cheap.
Differential Revision: https:/
[AArch64] Vector shifts cost 1
The costs of vector shifts was 2 as opposed to 1, as the nodes are marked custom. Fix this like the others and mark the nodes as cheap.
Differential Revision: https://reviews.llvm.org/D120773
show more ...
|
|
Revision tags: llvmorg-14.0.0-rc2 |
|
| #
15ba588d |
| 09-Feb-2022 |
Arthur Eubanks <aeubanks@google.com> |
[test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>'
|
|
Revision tags: llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2 |
|
| #
bc615e43 |
| 07-Jan-2022 |
David Green <david.green@arm.com> |
[AArch64] Update addo and subo costs
Similar to D116732, this adds basic scalar sadd_with_overflow, uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for aarch64, which are usually
[AArch64] Update addo and subo costs
Similar to D116732, this adds basic scalar sadd_with_overflow, uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for aarch64, which are usually quite efficiently lowered.
Differential Revision: https://reviews.llvm.org/D116734
show more ...
|
| #
c65270cf |
| 06-Jan-2022 |
David Green <david.green@arm.com> |
[AArch64] Add basic umulo and smulo costs
This adds some AArch64 specific smul_with_overflow and umul_with_overflow costs, overriding the default costs. The code generation for these mul with overfl
[AArch64] Add basic umulo and smulo costs
This adds some AArch64 specific smul_with_overflow and umul_with_overflow costs, overriding the default costs. The code generation for these mul with overflow intrinsics is usually better than the default expansion on AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll.
Differential Revision: https://reviews.llvm.org/D116732
show more ...
|
|
Revision tags: llvmorg-13.0.1-rc1 |
|
| #
74b2a4ed |
| 26-Oct-2021 |
David Green <david.green@arm.com> |
[AArch64] Add a costmodel test for overflowing arithmatic. NFC
|