History log of /llvm-project/llvm/test/Analysis/CostModel/AArch64/arith-overflow.ll (Results 1 – 10 of 10)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1
# 2a859b20 28-Jul-2023 David Green <david.green@arm.com>

[AArch64] Change the cost of vector insert/extract to 2

The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarizati

[AArch64] Change the cost of vector insert/extract to 2

The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.

This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).

The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.

We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.

Differential Revision: https://reviews.llvm.org/D155459

show more ...


Revision tags: llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0
# 242203d2 30-Aug-2022 Mingming Liu <mingmingl@google.com>

[AArch64][TTI] Add cost table entry for trunc over vector of integers.

1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of

[AArch64][TTI] Add cost table entry for trunc over vector of integers.

1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated.
2) Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2])
3) It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op". For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.

[1] For {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32}, https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L4472.
For concat (trunc, trunc) -> uzip1, https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L5428-L5437
[2] examples
- trunc(umin(X, 255)) -> UQXTRN v8i8 (and other {u,s}x{min,max} pattern for v8i16 operands) from https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L4515-L4528
- trunc (AArch64vlshr v8i16, imm) -> SHRNv8i8 (same missed for SHRNv2i32) from https://github.com/llvm/llvm-project/blob/813ae2871d71f32cce46768e63185cd64651f6e9/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L6743-L6748
[3]
---
; instruction latency / throughput / pipeline on `neoverse-n1`
bb.0:
%1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4> ; ushr, latency 2, throughput 1, pipeline V1
%2 = trunc <8 x i16> %1 to <8 x i8> ; xtn, latency 2, throughput 2, pipeline V
%3 = store <8 x i8> %1, ptr %addr
br cond i1 cond, label bb.1, label bb.2

bb.1:
%4 = trunc <8 x i16> %1 to <8 x i8> ; xtn

bb.2:
%5 = trunc <8 x i16> %1 to <8 x i8> ; xtn
---

Differential Revision: https://reviews.llvm.org/D132784

show more ...


# 3785234b 30-Aug-2022 Mingming Liu <mingmingl@google.com>

[NFC][AArch64] Specify datalayout explicitly for cast.ll and
arith-overflow.ll and update tests accordingly.

- These two tests stands out when data layout is explicitly added in a
sweep study (D13

[NFC][AArch64] Specify datalayout explicitly for cast.ll and
arith-overflow.ll and update tests accordingly.

- These two tests stands out when data layout is explicitly added in a
sweep study (D132889)

Differential Revision: https://reviews.llvm.org/D132856

show more ...


Revision tags: llvmorg-15.0.0-rc3
# 4178e334 10-Aug-2022 Simon Pilgrim <llvm-dev@redking.me.uk>

[CostModel] Update RUN -passes=* to double quotes to appease update scripts on windows

DOS really doesn't like `` quotes to be used in command lines

Some prep work as I'm intending to resurrect D79

[CostModel] Update RUN -passes=* to double quotes to appease update scripts on windows

DOS really doesn't like `` quotes to be used in command lines

Some prep work as I'm intending to resurrect D79483 soon

show more ...


Revision tags: llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1
# 750bf358 04-Apr-2022 David Green <david.green@arm.com>

[AArch64] Increase cost of v2i64 multiplies

The cost of a v2i64 multiply was special cased in D92208 as scalarized
into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are
expensive

[AArch64] Increase cost of v2i64 multiplies

The cost of a v2i64 multiply was special cased in D92208 as scalarized
into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are
expensive though, and the cost wasn't high enough to prevent vectorizing
in places where it can be detrimental for performance. This increases it
so that the costs of copying to/from GPRs is increased to 2 each, with
the total cost increasing to 14. So long as umull/smull are handled
correctly (as in D123006) this seems to lead to better vectorization
factors and better performance.

Differential Revision: https://reviews.llvm.org/D123007

show more ...


Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3
# 65c0e45a 03-Mar-2022 David Green <david.green@arm.com>

[AArch64] Vector shifts cost 1

The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.

Differential Revision: https:/

[AArch64] Vector shifts cost 1

The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.

Differential Revision: https://reviews.llvm.org/D120773

show more ...


Revision tags: llvmorg-14.0.0-rc2
# 15ba588d 09-Feb-2022 Arthur Eubanks <aeubanks@google.com>

[test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>'


Revision tags: llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2
# bc615e43 07-Jan-2022 David Green <david.green@arm.com>

[AArch64] Update addo and subo costs

Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually

[AArch64] Update addo and subo costs

Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.

Differential Revision: https://reviews.llvm.org/D116734

show more ...


# c65270cf 06-Jan-2022 David Green <david.green@arm.com>

[AArch64] Add basic umulo and smulo costs

This adds some AArch64 specific smul_with_overflow and umul_with_overflow
costs, overriding the default costs. The code generation for these mul
with overfl

[AArch64] Add basic umulo and smulo costs

This adds some AArch64 specific smul_with_overflow and umul_with_overflow
costs, overriding the default costs. The code generation for these mul
with overflow intrinsics is usually better than the default expansion on
AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various
types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll.

Differential Revision: https://reviews.llvm.org/D116732

show more ...


Revision tags: llvmorg-13.0.1-rc1
# 74b2a4ed 26-Oct-2021 David Green <david.green@arm.com>

[AArch64] Add a costmodel test for overflowing arithmatic. NFC