#
0a98efb0 |
| 15-Feb-2021 |
David Green <david.green@arm.com> |
[ARM] Add some basic Min/Max costs
This adds basic MVE costs for SMIN/SMAX/UMIN/UMAX, as well as MINNUM and MAXNUM representing fmin and fmax. It tightens up the costs, not using a ICmp+Select cost.
[ARM] Add some basic Min/Max costs
This adds basic MVE costs for SMIN/SMAX/UMIN/UMAX, as well as MINNUM and MAXNUM representing fmin and fmax. It tightens up the costs, not using a ICmp+Select cost.
Differential Revision: https://reviews.llvm.org/D96603
show more ...
|
#
357237e9 |
| 15-Feb-2021 |
Sjoerd Meijer <sjoerd.meijer@arm.com> |
Recommit "[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode"
This reverts commit effc3b079927a6dd3084b4ff712ec07f926366f0, with the build problem fixed.
|
#
effc3b07 |
| 15-Feb-2021 |
Sjoerd Meijer <sjoerd.meijer@arm.com> |
Revert "[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode"
This reverts commit cd6de0e8de4a5fd558580be4b1a07116914fc8ed.
|
#
cd6de0e8 |
| 12-Feb-2021 |
Sjoerd Meijer <sjoerd.meijer@arm.com> |
[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode
This refactors shouldFavorPostInc() and shouldFavorBackedgeIndex() into getPreferredAddressingMode() so that we have o
[TTI] Unify FavorPostInc and FavorBackedgeIndex into getPreferredAddressingMode
This refactors shouldFavorPostInc() and shouldFavorBackedgeIndex() into getPreferredAddressingMode() so that we have one interface to steer LSR in generating the preferred addressing mode.
Differential Revision: https://reviews.llvm.org/D96600
show more ...
|
#
79b1b4a5 |
| 12-Feb-2021 |
Sanjay Patel <spatel@rotateright.com> |
[Vectorizers][TTI] remove option to bypass creation of vector reduction intrinsics
The vector reduction intrinsics started life as experimental ops, so backend support was lacking. As part of promot
[Vectorizers][TTI] remove option to bypass creation of vector reduction intrinsics
The vector reduction intrinsics started life as experimental ops, so backend support was lacking. As part of promoting them to 1st-class intrinsics, however, codegen support was added/improved: D58015 D90247
So I think it is safe to now remove this complication from IR.
Note that we still have an IR-level codegen expansion pass for these as discussed in D95690. Removing that is another step in simplifying the logic. Also note that x86 was already unconditionally forming reductions in IR, so there should be no difference for x86.
I spot checked a couple of the tests here by running them through opt+llc and did not see any asm diffs.
If we do find functional differences for other targets, it should be possible to (at least temporarily) restore the shuffle IR with the ExpandReductions IR pass.
Differential Revision: https://reviews.llvm.org/D96552
show more ...
|
#
b1ef919a |
| 11-Feb-2021 |
David Green <david.green@arm.com> |
[ARM] Add CostKind to getMVEVectorCostFactor.
This adds the CostKind to getMVEVectorCostFactor, so that it can automatically account for CodeSize costs, where it returns a cost of 1 not the MVEFacto
[ARM] Add CostKind to getMVEVectorCostFactor.
This adds the CostKind to getMVEVectorCostFactor, so that it can automatically account for CodeSize costs, where it returns a cost of 1 not the MVEFactor used for Throughput/Latency. This helps simplify the caller code and allows us to get the codesize cost more correct in more cases.
show more ...
|
#
e771614b |
| 11-Feb-2021 |
David Green <david.green@arm.com> |
[ARM] Change getScalarizationOverhead overload used in gather costs. NFC
This changes which of the getScalarizationOverhead overloads is used in the gather/scatter cost to use the base variant direc
[ARM] Change getScalarizationOverhead overload used in gather costs. NFC
This changes which of the getScalarizationOverhead overloads is used in the gather/scatter cost to use the base variant directly, not relying on the version using heuristics on the number of args with no args provided. It should still produce the same costs for scalarized gathers/scatters.
show more ...
|
#
92028062 |
| 09-Feb-2021 |
Jinsong Ji <jji@us.ibm.com> |
Revert "[CostModel] Remove VF from IntrinsicCostAttributes"
This reverts commit 502a67dd7f23901834e05071ab253889f671b5d9.
This expose a failure in test-suite build on PowerPC, revert to unblock bui
Revert "[CostModel] Remove VF from IntrinsicCostAttributes"
This reverts commit 502a67dd7f23901834e05071ab253889f671b5d9.
This expose a failure in test-suite build on PowerPC, revert to unblock buildbot first, Dave will re-commit in https://reviews.llvm.org/D96287.
Thanks Dave.
show more ...
|
#
502a67dd |
| 05-Feb-2021 |
David Green <david.green@arm.com> |
[CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various parameters of the intrinsic being costed. It can either be called with a scal
[CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various parameters of the intrinsic being costed. It can either be called with a scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction (RetTy==Vector, VF==1) or from the vectorizer with a scalar type and vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered an error. Both of the vector modes are expected to be treated the same, but because this is confusing many backends end up getting it wrong.
Instead of trying work with those two values separately this removes the VF parameter, widening the RetTy/ArgTys by VF used called from the vectorizer. This keeps things simpler, but does require some other modifications to keep things consistent.
Most backends look like this will be an improvement (or were not using getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code from c230965ccf36af5c88c working. ARM removed the fix in dfac521da1b90db683, webassembly happens to get a fixup for an SLP cost issue and both X86 and AArch64 seem to now be using better costs from the vectorizer.
Differential Revision: https://reviews.llvm.org/D95291
show more ...
|
#
40f46cb0 |
| 28-Jan-2021 |
David Green <david.green@arm.com> |
[ARM] Add alignment checks for MVE VLDn
The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned to at least the size of the element type. This adds a check for that into the ARM low
[ARM] Add alignment checks for MVE VLDn
The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned to at least the size of the element type. This adds a check for that into the ARM lowerInterleavedStore and lowerInterleavedLoad functions, not creating the intrinsics if they are invalid for the alignment of the load/store.
Unfortunately this is one of those bug fixes that does effect some useful codegen, as we were able to sometimes do some nice lowering of q15 types. But they can cause problem with low aligned pointers.
Differential Revision: https://reviews.llvm.org/D95319
show more ...
|
#
39db5753 |
| 21-Jan-2021 |
David Green <david.green@arm.com> |
[LV][ARM] Inloop reduction cost modelling
This adds cost modelling for the inloop vectorization added in 745bf6cf4471. Up until now they have been modelled as the original underlying instruction, us
[LV][ARM] Inloop reduction cost modelling
This adds cost modelling for the inloop vectorization added in 745bf6cf4471. Up until now they have been modelled as the original underlying instruction, usually an add. This happens to works OK for MVE with instructions that are reducing into the same type as they are working on. But MVE's instructions can perform the equivalent of an extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32> %sb = sext <16 x i8> B to <16 x i32> %m = mul <16 x i32> %sa, %sb %r = vecreduce.add(%m) -> R = VMLADAV A, B
There are other instructions for performing add reductions of v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64 (VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV). The i64 are particularly interesting as there are no native i64 add/mul instructions, leading to the i64 add and mul naturally getting very high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot instruction which performs a partial reduction from a v16i8 to a v4i32. They extend and mul/sum the first four elements from the inputs into the first element of the output, repeating for each of the four output lanes. They could possibly be represented in the same way as above in llvm, so long as a vecreduce.add could perform a partial reduction. The vectorizer would then produce a combination of in and outer loop reductions to efficiently use the sdot and udot instructions. Although this patch does not do that yet, it does suggest that separating the input reduction type from the produced result type is a useful concept to model. It also shows that a MLA reduction as a single instruction is fairly common.
This patch attempt to improve the costmodelling of in-loop reductions by: - Adding some pattern matching in the loop vectorizer cost model to match extended reduction patterns that are optionally extended and/or MLA patterns. This marks the cost of the reduction instruction correctly and the sext/zext/mul leading up to it as free, which is otherwise difficult to tell and may get a very high cost. (In the long run this can hopefully be replaced by vplan producing a single node and costing it correctly, but that is not yet something that vplan can do). - getExtendedAddReductionCost is added to query the cost of these extended reduction patterns. - Expanded the ARM costs to account for these expanded sizes, which is a fairly simple change in itself. - Some minor alterations to allow inloop reduction larger than the highest vector width and i64 MVE reductions. - An extra InLoopReductionImmediateChains map was added to the vectorizer for it to efficiently detect which instructions are reductions in the cost model. - The tests have some updates to show what I believe is optimal vectorization and where we are now.
Put together this can greatly improve performance for reduction loop under MVE.
Differential Revision: https://reviews.llvm.org/D93476
show more ...
|
#
dfac521d |
| 21-Jan-2021 |
David Green <david.green@arm.com> |
[ARM] Fix vector saddsat costs.
It turns out the vectorizer calls the getIntrinsicInstrCost functions with a scalar return type and vector VF. This updates the costmodel to handle that, still produc
[ARM] Fix vector saddsat costs.
It turns out the vectorizer calls the getIntrinsicInstrCost functions with a scalar return type and vector VF. This updates the costmodel to handle that, still producing the correct vector costs.
A vectorizer test is added to show it vectorizing at the correct factor again.
show more ...
|
#
f373b309 |
| 19-Jan-2021 |
David Green <david.green@arm.com> |
[ARM] Add MVE add.sat costs
This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs, based on when the instruction is legal. With smaller than legal types that are promoted we generate sh
[ARM] Add MVE add.sat costs
This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs, based on when the instruction is legal. With smaller than legal types that are promoted we generate shr(qadd(shl, shl)), so the cost is 4 appropriately.
Differential Revision: https://reviews.llvm.org/D94958
show more ...
|
Revision tags: llvmorg-11.1.0-rc1 |
|
#
dcefcd51 |
| 11-Jan-2021 |
David Green <david.green@arm.com> |
[ARM] Update trunc costs
We did not have specific costs for larger than legal truncates that were not otherwise cheap (where they were next to stores, for example). As MVE does not have a dedicated
[ARM] Update trunc costs
We did not have specific costs for larger than legal truncates that were not otherwise cheap (where they were next to stores, for example). As MVE does not have a dedicated instruction for them (and we do not use loads/stores yet), they should be expensive as they get expanded to a series of lane moves.
Differential Revision: https://reviews.llvm.org/D94260
show more ...
|
#
0e219b64 |
| 03-Jan-2021 |
Kazu Hirata <kazu@google.com> |
[Target] Construct SmallVector with iterator ranges (NFC)
|
Revision tags: llvmorg-11.0.1, llvmorg-11.0.1-rc2 |
|
#
a4823377 |
| 12-Dec-2020 |
David Green <david.green@arm.com> |
[ARM] Add basic masked load/store costs
This adds some basic MVE masked load/store costs, notably changing the cost of legal loads/stores to the MVECostFactor and the cost of scalarized instructions
[ARM] Add basic masked load/store costs
This adds some basic MVE masked load/store costs, notably changing the cost of legal loads/stores to the MVECostFactor and the cost of scalarized instructions to 8*NumElts.
Differential Revision: https://reviews.llvm.org/D86538
show more ...
|
Revision tags: llvmorg-11.0.1-rc1 |
|
#
f08c37da |
| 20-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Disable WLSTP loops
This checks to see if the loop will likely become a tail predicated loop and disables wls loop generation if so, as the likelihood for reverting is currently too high. Thes
[ARM] Disable WLSTP loops
This checks to see if the loop will likely become a tail predicated loop and disables wls loop generation if so, as the likelihood for reverting is currently too high. These should be fairly rare situations anyway due to the way iterations and element counts are used during lowering. Just not trying can alter how SCEV's are materialized however, leading to different codegen.
It also adds a option to disable all while low overhead loops, for debugging.
Differential Revision: https://reviews.llvm.org/D91663
show more ...
|
#
006b3bde |
| 19-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Deliberately prevent inline asm in low overhead loops. NFC
This was already something that was handled by one of the "else" branches in maybeLoweredToCall, so this patch is an NFC but makes it
[ARM] Deliberately prevent inline asm in low overhead loops. NFC
This was already something that was handled by one of the "else" branches in maybeLoweredToCall, so this patch is an NFC but makes it explicit and adds a test. We may in the future want to support this under certain situations but for the moment just don't try and create low overhead loops with inline asm in them.
Differential Revision: https://reviews.llvm.org/D91257
show more ...
|
#
c7e27538 |
| 10-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Don't aggressively unroll vector remainder loops
We already do not unroll loops with vector instructions under MVE, but that does not include the remainder loops that the vectorizer produces.
[ARM] Don't aggressively unroll vector remainder loops
We already do not unroll loops with vector instructions under MVE, but that does not include the remainder loops that the vectorizer produces. These remainder loops will be rarely executed and are not worth unrolling, as the trip count is likely to be low if they get executed at all. Luckily they get llvm.loop.isvectorized to make recognizing them simpler.
We have wanted to do this for a while but hit issues with low overhead loops being reverted due to difficult registry allocation. With recent changes that seems to be less of an issue now.
Differential Revision: https://reviews.llvm.org/D90055
show more ...
|
#
b2ac9681 |
| 10-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Alter t2DoLoopStart to define lr
This changes the definition of t2DoLoopStart from t2DoLoopStart rGPR to GPRlr = t2DoLoopStart rGPR
This will hopefully mean that low overhead loops are more t
[ARM] Alter t2DoLoopStart to define lr
This changes the definition of t2DoLoopStart from t2DoLoopStart rGPR to GPRlr = t2DoLoopStart rGPR
This will hopefully mean that low overhead loops are more tied together, and we can more reliably generate loops without reverting or being at the whims of the register allocator.
This is a fairly simple change in itself, but leads to a number of other required alterations.
- The hardware loop pass, if UsePhi is set, now generates loops of the form: %start = llvm.start.loop.iterations(%N) loop: %p = phi [%start], [%dec] %dec = llvm.loop.decrement.reg(%p, 1) %c = icmp ne %dec, 0 br %c, loop, exit - For this a new llvm.start.loop.iterations intrinsic was added, identical to llvm.set.loop.iterations but produces a value as seen above, gluing the loop together more through def-use chains. - This new instrinsic conceptually produces the same output as input, which is taught to SCEV so that the checks in MVETailPredication are not affected. - Some minor changes are needed to the ARMLowOverheadLoop pass, but it has been left mostly as before. We should now more reliably be able to tell that the t2DoLoopStart is correct without having to prove it, but t2WhileLoopStart and tail-predicated loops will remain the same. - And all the tests have been updated. There are a lot of them!
This patch on it's own might cause more trouble that it helps, with more tail-predicated loops being reverted, but some additional patches can hopefully improve upon that to get to something that is better overall.
Differential Revision: https://reviews.llvm.org/D89881
show more ...
|
#
264a6df3 |
| 05-Nov-2020 |
Sanjay Patel <spatel@rotateright.com> |
[ARM] remove cost-kind predicate for cmp/sel costs
This is the cmp/sel sibling to D90692. Again, the reasoning is: the throughput cost is number of instructions/uops, so size/blended costs are ident
[ARM] remove cost-kind predicate for cmp/sel costs
This is the cmp/sel sibling to D90692. Again, the reasoning is: the throughput cost is number of instructions/uops, so size/blended costs are identical except in special cases (for example, fdiv or other known-expensive machine instructions or things like MVE that may require cracking into >1 uops).
We need to check for a valid (non-null) condition type parameter because SimplifyCFG may pass nullptr for that (and so we will crash multiple regression tests without that check). I'm not sure if passing nullptr makes sense, but other code in the cost model does appear to check if that param is set or not.
Differential Revision: https://reviews.llvm.org/D90781
show more ...
|
#
eb611930 |
| 04-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Remove unused variable. NFC
|
#
c40126e7 |
| 03-Nov-2020 |
Sanjay Patel <spatel@rotateright.com> |
[ARM] remove cost-kind predicate for most math op costs
This is based on the same idea that I am using for the basic model implementation and what I have partly already done for x86: throughput cost
[ARM] remove cost-kind predicate for most math op costs
This is based on the same idea that I am using for the basic model implementation and what I have partly already done for x86: throughput cost is number of instructions/uops, so size/blended costs are identical except in special cases (for example, fdiv or other known-expensive machine instructions or things like MVE that may require cracking into >1 uop)).
Differential Revision: https://reviews.llvm.org/D90692
show more ...
|
#
bd323864 |
| 03-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Remove unused variable. NFC
|
#
e4744994 |
| 03-Nov-2020 |
David Green <david.green@arm.com> |
[ARM] Treat memcpy/memset/memmove as call instructions for low overhead loops
If an instruction will be lowered to a call there is no advantage of using a low overhead loop as the LR register will n
[ARM] Treat memcpy/memset/memmove as call instructions for low overhead loops
If an instruction will be lowered to a call there is no advantage of using a low overhead loop as the LR register will need to be spilled and reloaded around the call, and the low overhead will end up being reverted. This teaches our hardware loop lowering that these memory intrinsics will be calls under certain situations.
Differential Revision: https://reviews.llvm.org/D90439
show more ...
|