History log of /llvm-project/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp (Results 126 – 150 of 352)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 0a98efb0 15-Feb-2021 David Green <david.green@arm.com>

[ARM] Add some basic Min/Max costs

This adds basic MVE costs for SMIN/SMAX/UMIN/UMAX, as well as MINNUM and
MAXNUM representing fmin and fmax. It tightens up the costs, not using a
ICmp+Select cost.

[ARM] Add some basic Min/Max costs

This adds basic MVE costs for SMIN/SMAX/UMIN/UMAX, as well as MINNUM and
MAXNUM representing fmin and fmax. It tightens up the costs, not using a
ICmp+Select cost.

Differential Revision: https://reviews.llvm.org/D96603

show more ...


# 357237e9 15-Feb-2021 Sjoerd Meijer <sjoerd.meijer@arm.com>

Recommit "[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode"

This reverts commit effc3b079927a6dd3084b4ff712ec07f926366f0, with the build
problem fixed.


# effc3b07 15-Feb-2021 Sjoerd Meijer <sjoerd.meijer@arm.com>

Revert "[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode"

This reverts commit cd6de0e8de4a5fd558580be4b1a07116914fc8ed.


# cd6de0e8 12-Feb-2021 Sjoerd Meijer <sjoerd.meijer@arm.com>

[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode

This refactors shouldFavorPostInc() and shouldFavorBackedgeIndex() into
getPreferredAddressingMode() so that we have o

[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode

This refactors shouldFavorPostInc() and shouldFavorBackedgeIndex() into
getPreferredAddressingMode() so that we have one interface to steer LSR in
generating the preferred addressing mode.

Differential Revision: https://reviews.llvm.org/D96600

show more ...


# 79b1b4a5 12-Feb-2021 Sanjay Patel <spatel@rotateright.com>

[Vectorizers][TTI] remove option to bypass creation of vector reduction intrinsics

The vector reduction intrinsics started life as experimental ops, so backend support
was lacking. As part of promot

[Vectorizers][TTI] remove option to bypass creation of vector reduction intrinsics

The vector reduction intrinsics started life as experimental ops, so backend support
was lacking. As part of promoting them to 1st-class intrinsics, however, codegen
support was added/improved:
D58015
D90247

So I think it is safe to now remove this complication from IR.

Note that we still have an IR-level codegen expansion pass for these as discussed
in D95690. Removing that is another step in simplifying the logic. Also note that
x86 was already unconditionally forming reductions in IR, so there should be no
difference for x86.

I spot checked a couple of the tests here by running them through opt+llc and did
not see any asm diffs.

If we do find functional differences for other targets, it should be possible
to (at least temporarily) restore the shuffle IR with the ExpandReductions IR
pass.

Differential Revision: https://reviews.llvm.org/D96552

show more ...


# b1ef919a 11-Feb-2021 David Green <david.green@arm.com>

[ARM] Add CostKind to getMVEVectorCostFactor.

This adds the CostKind to getMVEVectorCostFactor, so that it can
automatically account for CodeSize costs, where it returns a cost of 1
not the MVEFacto

[ARM] Add CostKind to getMVEVectorCostFactor.

This adds the CostKind to getMVEVectorCostFactor, so that it can
automatically account for CodeSize costs, where it returns a cost of 1
not the MVEFactor used for Throughput/Latency. This helps simplify the
caller code and allows us to get the codesize cost more correct in more
cases.

show more ...


# e771614b 11-Feb-2021 David Green <david.green@arm.com>

[ARM] Change getScalarizationOverhead overload used in gather costs. NFC

This changes which of the getScalarizationOverhead overloads is used in
the gather/scatter cost to use the base variant direc

[ARM] Change getScalarizationOverhead overload used in gather costs. NFC

This changes which of the getScalarizationOverhead overloads is used in
the gather/scatter cost to use the base variant directly, not relying on
the version using heuristics on the number of args with no args
provided. It should still produce the same costs for scalarized
gathers/scatters.

show more ...


# 92028062 09-Feb-2021 Jinsong Ji <jji@us.ibm.com>

Revert "[CostModel] Remove VF from IntrinsicCostAttributes"

This reverts commit 502a67dd7f23901834e05071ab253889f671b5d9.

This expose a failure in test-suite build on PowerPC,
revert to unblock bui

Revert "[CostModel] Remove VF from IntrinsicCostAttributes"

This reverts commit 502a67dd7f23901834e05071ab253889f671b5d9.

This expose a failure in test-suite build on PowerPC,
revert to unblock buildbot first,
Dave will re-commit in https://reviews.llvm.org/D96287.

Thanks Dave.

show more ...


# 502a67dd 05-Feb-2021 David Green <david.green@arm.com>

[CostModel] Remove VF from IntrinsicCostAttributes

getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scal

[CostModel] Remove VF from IntrinsicCostAttributes

getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf36af5c88c working. ARM removed the fix in
dfac521da1b90db683, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.

Differential Revision: https://reviews.llvm.org/D95291

show more ...


# 40f46cb0 28-Jan-2021 David Green <david.green@arm.com>

[ARM] Add alignment checks for MVE VLDn

The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned
to at least the size of the element type. This adds a check for that
into the ARM low

[ARM] Add alignment checks for MVE VLDn

The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned
to at least the size of the element type. This adds a check for that
into the ARM lowerInterleavedStore and lowerInterleavedLoad functions,
not creating the intrinsics if they are invalid for the alignment of
the load/store.

Unfortunately this is one of those bug fixes that does effect some
useful codegen, as we were able to sometimes do some nice lowering of
q15 types. But they can cause problem with low aligned pointers.

Differential Revision: https://reviews.llvm.org/D95319

show more ...


# 39db5753 21-Jan-2021 David Green <david.green@arm.com>

[LV][ARM] Inloop reduction cost modelling

This adds cost modelling for the inloop vectorization added in
745bf6cf4471. Up until now they have been modelled as the original
underlying instruction, us

[LV][ARM] Inloop reduction cost modelling

This adds cost modelling for the inloop vectorization added in
745bf6cf4471. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:

%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B

There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.

Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.

This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.

Put together this can greatly improve performance for reduction loop
under MVE.

Differential Revision: https://reviews.llvm.org/D93476

show more ...


# dfac521d 21-Jan-2021 David Green <david.green@arm.com>

[ARM] Fix vector saddsat costs.

It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still produc

[ARM] Fix vector saddsat costs.

It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.

A vectorizer test is added to show it vectorizing at the correct factor
again.

show more ...


# f373b309 19-Jan-2021 David Green <david.green@arm.com>

[ARM] Add MVE add.sat costs

This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs,
based on when the instruction is legal. With smaller than legal types
that are promoted we generate sh

[ARM] Add MVE add.sat costs

This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs,
based on when the instruction is legal. With smaller than legal types
that are promoted we generate shr(qadd(shl, shl)), so the cost is 4
appropriately.

Differential Revision: https://reviews.llvm.org/D94958

show more ...


Revision tags: llvmorg-11.1.0-rc1
# dcefcd51 11-Jan-2021 David Green <david.green@arm.com>

[ARM] Update trunc costs

We did not have specific costs for larger than legal truncates that were
not otherwise cheap (where they were next to stores, for example). As
MVE does not have a dedicated

[ARM] Update trunc costs

We did not have specific costs for larger than legal truncates that were
not otherwise cheap (where they were next to stores, for example). As
MVE does not have a dedicated instruction for them (and we do not use
loads/stores yet), they should be expensive as they get expanded to a
series of lane moves.

Differential Revision: https://reviews.llvm.org/D94260

show more ...


# 0e219b64 03-Jan-2021 Kazu Hirata <kazu@google.com>

[Target] Construct SmallVector with iterator ranges (NFC)


Revision tags: llvmorg-11.0.1, llvmorg-11.0.1-rc2
# a4823377 12-Dec-2020 David Green <david.green@arm.com>

[ARM] Add basic masked load/store costs

This adds some basic MVE masked load/store costs, notably changing the
cost of legal loads/stores to the MVECostFactor and the cost of
scalarized instructions

[ARM] Add basic masked load/store costs

This adds some basic MVE masked load/store costs, notably changing the
cost of legal loads/stores to the MVECostFactor and the cost of
scalarized instructions to 8*NumElts.

Differential Revision: https://reviews.llvm.org/D86538

show more ...


Revision tags: llvmorg-11.0.1-rc1
# f08c37da 20-Nov-2020 David Green <david.green@arm.com>

[ARM] Disable WLSTP loops

This checks to see if the loop will likely become a tail predicated loop
and disables wls loop generation if so, as the likelihood for reverting
is currently too high. Thes

[ARM] Disable WLSTP loops

This checks to see if the loop will likely become a tail predicated loop
and disables wls loop generation if so, as the likelihood for reverting
is currently too high. These should be fairly rare situations anyway due
to the way iterations and element counts are used during lowering. Just
not trying can alter how SCEV's are materialized however, leading to
different codegen.

It also adds a option to disable all while low overhead loops, for
debugging.

Differential Revision: https://reviews.llvm.org/D91663

show more ...


# 006b3bde 19-Nov-2020 David Green <david.green@arm.com>

[ARM] Deliberately prevent inline asm in low overhead loops. NFC

This was already something that was handled by one of the "else"
branches in maybeLoweredToCall, so this patch is an NFC but makes it

[ARM] Deliberately prevent inline asm in low overhead loops. NFC

This was already something that was handled by one of the "else"
branches in maybeLoweredToCall, so this patch is an NFC but makes it
explicit and adds a test. We may in the future want to support this
under certain situations but for the moment just don't try and create
low overhead loops with inline asm in them.

Differential Revision: https://reviews.llvm.org/D91257

show more ...


# c7e27538 10-Nov-2020 David Green <david.green@arm.com>

[ARM] Don't aggressively unroll vector remainder loops

We already do not unroll loops with vector instructions under MVE, but
that does not include the remainder loops that the vectorizer produces.

[ARM] Don't aggressively unroll vector remainder loops

We already do not unroll loops with vector instructions under MVE, but
that does not include the remainder loops that the vectorizer produces.
These remainder loops will be rarely executed and are not worth
unrolling, as the trip count is likely to be low if they get executed at
all. Luckily they get llvm.loop.isvectorized to make recognizing them
simpler.

We have wanted to do this for a while but hit issues with low overhead
loops being reverted due to difficult registry allocation. With recent
changes that seems to be less of an issue now.

Differential Revision: https://reviews.llvm.org/D90055

show more ...


# b2ac9681 10-Nov-2020 David Green <david.green@arm.com>

[ARM] Alter t2DoLoopStart to define lr

This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR

This will hopefully mean that low overhead loops are more t

[ARM] Alter t2DoLoopStart to define lr

This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR

This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.

This is a fairly simple change in itself, but leads to a number of other
required alterations.

- The hardware loop pass, if UsePhi is set, now generates loops of the
form:
%start = llvm.start.loop.iterations(%N)
loop:
%p = phi [%start], [%dec]
%dec = llvm.loop.decrement.reg(%p, 1)
%c = icmp ne %dec, 0
br %c, loop, exit
- For this a new llvm.start.loop.iterations intrinsic was added, identical
to llvm.set.loop.iterations but produces a value as seen above, gluing
the loop together more through def-use chains.
- This new instrinsic conceptually produces the same output as input,
which is taught to SCEV so that the checks in MVETailPredication are not
affected.
- Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
been left mostly as before. We should now more reliably be able to tell
that the t2DoLoopStart is correct without having to prove it, but
t2WhileLoopStart and tail-predicated loops will remain the same.
- And all the tests have been updated. There are a lot of them!

This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.

Differential Revision: https://reviews.llvm.org/D89881

show more ...


# 264a6df3 05-Nov-2020 Sanjay Patel <spatel@rotateright.com>

[ARM] remove cost-kind predicate for cmp/sel costs

This is the cmp/sel sibling to D90692.
Again, the reasoning is: the throughput cost is number of instructions/uops,
so size/blended costs are ident

[ARM] remove cost-kind predicate for cmp/sel costs

This is the cmp/sel sibling to D90692.
Again, the reasoning is: the throughput cost is number of instructions/uops,
so size/blended costs are identical except in special cases (for example,
fdiv or other known-expensive machine instructions or things like MVE that
may require cracking into >1 uops).

We need to check for a valid (non-null) condition type parameter because
SimplifyCFG may pass nullptr for that (and so we will crash multiple
regression tests without that check). I'm not sure if passing nullptr makes
sense, but other code in the cost model does appear to check if that param
is set or not.

Differential Revision: https://reviews.llvm.org/D90781

show more ...


# eb611930 04-Nov-2020 David Green <david.green@arm.com>

[ARM] Remove unused variable. NFC


# c40126e7 03-Nov-2020 Sanjay Patel <spatel@rotateright.com>

[ARM] remove cost-kind predicate for most math op costs

This is based on the same idea that I am using for the basic model implementation
and what I have partly already done for x86: throughput cost

[ARM] remove cost-kind predicate for most math op costs

This is based on the same idea that I am using for the basic model implementation
and what I have partly already done for x86: throughput cost is number of
instructions/uops, so size/blended costs are identical except in special cases
(for example, fdiv or other known-expensive machine instructions or things like
MVE that may require cracking into >1 uop)).

Differential Revision: https://reviews.llvm.org/D90692

show more ...


# bd323864 03-Nov-2020 David Green <david.green@arm.com>

[ARM] Remove unused variable. NFC


# e4744994 03-Nov-2020 David Green <david.green@arm.com>

[ARM] Treat memcpy/memset/memmove as call instructions for low overhead loops

If an instruction will be lowered to a call there is no advantage of
using a low overhead loop as the LR register will n

[ARM] Treat memcpy/memset/memmove as call instructions for low overhead loops

If an instruction will be lowered to a call there is no advantage of
using a low overhead loop as the LR register will need to be spilled and
reloaded around the call, and the low overhead will end up being
reverted. This teaches our hardware loop lowering that these memory
intrinsics will be calls under certain situations.

Differential Revision: https://reviews.llvm.org/D90439

show more ...


12345678910>>...15