#
e8ad87c7 |
| 08-Aug-2024 |
Yeting Kuo <46629943+yetingk@users.noreply.github.com> |
[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate chec
[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch provide TTI hooks to
make targets describe their intrinsic informations to asan.
Note,
1. this patch renames InterestingMemoryOperand to MemoryRefInfo.
2. this patch does not support RVV indexed/segment load/store.
show more ...
|
Revision tags: llvmorg-19.1.0-rc2 |
|
#
9e462b7e |
| 29-Jul-2024 |
Fabian Ritter <fabian.ritter@amd.com> |
[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984)
...and also in TTI::getMemcpyLoopResidualLoweringType.
|
Revision tags: llvmorg-19.1.0-rc1, llvmorg-20-init |
|
#
3d494bfc |
| 22-Jul-2024 |
Tianqing Wang <tianqing.wang@intel.com> |
[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. (#98495)
The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is s
[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. (#98495)
The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.
A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
show more ...
|
#
c5329c82 |
| 17-Jul-2024 |
Sjoerd Meijer <smeijer@nvidia.com> |
[LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (#95819)
For the Neoverse V2 we would like to prefer fixed width over scalable
vectorisation if the cost-model
[LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (#95819)
For the Neoverse V2 we would like to prefer fixed width over scalable
vectorisation if the cost-model assigns an equal cost to both for certain
loops. This improves 7 kernels from TSVC-2 and several production kernels by
about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI
hook that can steer the loop vectorizater to preferring fixed width
vectorization, which can be set per CPU. For now, this is only enabled for the
Neoverse V2.
There are 3 reasons why preferring NEON might be better in the case the
cost-model is a tie and the SVE vector size is the same as NEON (128-bit):
architectural reasons, micro-architecture reasons, and SVE codegen reasons. The
latter will be improved over time, so the more important reasons are the former
two. I.e., (micro) architecture reason is the use of LPD/STP instructions which
are not available in SVE2 and it avoids predication.
For what it is worth: this codegen strategy to generate more NEON is inline
with GCC's codegen strategy, which is actually even more aggressive in
generating NEON when no predication is required. We could be smarter about the
decision making, but this seems to be a first good step in the right direction,
and we can always revise this later (for example make the target hook more
general).
show more ...
|
#
d28ed29d |
| 17-Jul-2024 |
Sam Parker <sam.parker@arm.com> |
[TTI][WebAssembly] Pairwise reduction expansion (#93948)
WebAssembly doesn't support horizontal operations nor does it have a way
of expressing fast-math or reassoc flags, so runtimes are currently
[TTI][WebAssembly] Pairwise reduction expansion (#93948)
WebAssembly doesn't support horizontal operations nor does it have a way
of expressing fast-math or reassoc flags, so runtimes are currently
unable to use pairwise operations when generating code from the existing
shuffle patterns.
This patch allows the backend to select which, arbitary, shuffle pattern
to be used per reduction intrinsic. The default behaviour is the same as
the existing, which is by splitting the vector into a top and bottom
half. The other pattern introduced is for a pairwise shuffle.
WebAssembly enables pairwise reductions for int/fp add/sub.
show more ...
|
#
9df71d76 |
| 28-Jun-2024 |
Nikita Popov <npopov@redhat.com> |
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, re
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, replacing the
current `getParent()->getDataLayout()` pattern.
show more ...
|
#
15fc801c |
| 27-Jun-2024 |
Shengchen Kan <shengchen.kan@intel.com> |
[X86][CodeGen] Support hoisting load/store with conditional faulting (#96720)
1. Add TTI interface for conditional load/store.
2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not
l
[X86][CodeGen] Support hoisting load/store with conditional faulting (#96720)
1. Add TTI interface for conditional load/store.
2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not
legalized in pass scalarize-masked-mem-intrin.
3. Visit 1 x i16/i32/i64 masked load/store to build a target-specific
CLOAD/CSTORE node to avoid error in
`DAGTypeLegalizer::ScalarizeVectorResult`.
4. Combine DAG to simplify the nodes for CLOAD/CSTORE.
5. Lower CLOAD/CSTORE to CFCMOV by pattern match.
This is CodeGen part of #95515
show more ...
|
#
1462605a |
| 25-Jun-2024 |
Kazu Hirata <kazu@google.com> |
[Analysis] Use range-based for loops (NFC) (#96587)
|
Revision tags: llvmorg-18.1.8, llvmorg-18.1.7 |
|
#
5a201415 |
| 05-Jun-2024 |
Alex Bradbury <asb@igalia.com> |
[LSR] Provide TTI hook to enable dropping solutions deemed to be unprofitable (#89924)
<https://reviews.llvm.org/D126043> introduced a flag to drop solutions
if deemed unprofitable. As noted there,
[LSR] Provide TTI hook to enable dropping solutions deemed to be unprofitable (#89924)
<https://reviews.llvm.org/D126043> introduced a flag to drop solutions
if deemed unprofitable. As noted there, introducing a TTI hook enables
backends to individually opt into this behaviour.
This will be used by #89927.
show more ...
|
#
e8dd4df7 |
| 22-May-2024 |
Tyler Lanphear <tylanphear@gmail.com> |
[NFC][TTI] Mark `getReplicationShuffleCost()` as `const` (#92194)
|
Revision tags: llvmorg-18.1.6 |
|
#
fbb37e96 |
| 13-May-2024 |
Graham Hunter <graham.hunter@arm.com> |
[AArch64] Add an all-in-one histogram intrinsic
Based on discussion from
https://discourse.llvm.org/t/rfc-vectorization-support-for-histogram-count-operations/74788
Current interface is:
llvm
[AArch64] Add an all-in-one histogram intrinsic
Based on discussion from
https://discourse.llvm.org/t/rfc-vectorization-support-for-histogram-count-operations/74788
Current interface is:
llvm.experimental.histogram(<vecty> ptrs, <intty> inc_amount, <vecty> mask)
The integer type used by 'inc_amount' needs to match the type of the buckets in memory.
The intrinsic covers the following operations:
* Gather load
* histogram on the elements of 'ptrs'
* multiply the histogram results by 'inc_amount'
* add the result of the multiply to the values loaded by the gather
* scatter store the results of the add
Supports lowering to histcnt instructions for AArch64 targets, and scalarization for all others at present.
show more ...
|
#
2e8d8155 |
| 10-May-2024 |
Graham Hunter <graham.hunter@arm.com> |
[TTI] Support scalable offsets in getScalingFactorCost (#88113)
Part of the work to support vscale-relative immediates in LSR.
|
Revision tags: llvmorg-18.1.5, llvmorg-18.1.4 |
|
#
4ac2721e |
| 09-Apr-2024 |
David Green <david.green@arm.com> |
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as sto
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
show more ...
|
Revision tags: llvmorg-18.1.3 |
|
#
36a3f8f6 |
| 20-Mar-2024 |
Graham Hunter <graham.hunter@arm.com> |
[TTI][TLI][AArch64] Support scalable immediates with isLegalAddImmediate (#84173)
Adds a second parameter (default to 0) to isLegalAddImmediate, to
represent a scalable immediate.
Extends the AA
[TTI][TLI][AArch64] Support scalable immediates with isLegalAddImmediate (#84173)
Adds a second parameter (default to 0) to isLegalAddImmediate, to
represent a scalable immediate.
Extends the AArch64 implementation to match immediates based on what addvl and inc[h|w|d] support.
show more ...
|
#
cd768ec9 |
| 20-Mar-2024 |
Graham Hunter <graham.hunter@arm.com> |
[AArch64] Support scalable offsets with isLegalAddressingMode (#83255)
Allows us to indicate that an addressing mode featuring a
vscale-relative immediate offset is supported.
|
Revision tags: llvmorg-18.1.2 |
|
#
f795d1a8 |
| 14-Mar-2024 |
Paschalis Mpeis <paschalis.mpeis@arm.com> |
[AArch64][LV][SLP] Vectorizers use call cost for vectorized frem (#82488)
getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer
to compute the cost of frem, which becomes a call
[AArch64][LV][SLP] Vectorizers use call cost for vectorized frem (#82488)
getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer
to compute the cost of frem, which becomes a call cost on AArch64 when
TLI has a vector library function.
Add tests that do SLP vectorization for code that contains 2x double and
4x float frem instructions.
show more ...
|
Revision tags: llvmorg-18.1.1 |
|
#
889d99a5 |
| 06-Mar-2024 |
Kolya Panchenko <87679760+nikolaypanchenko@users.noreply.github.com> |
[TTI] Add alignment argument to TTI for compress/expand support (#83516)
Since `llvm.compressstore` and `llvm.expandload` do require memory
access, it's essential for some target to check if alignm
[TTI] Add alignment argument to TTI for compress/expand support (#83516)
Since `llvm.compressstore` and `llvm.expandload` do require memory
access, it's essential for some target to check if alignment is good to
be able to lower them to target-specific instructions
show more ...
|
Revision tags: llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2 |
|
#
8ad14b6d |
| 01-Feb-2024 |
Alexey Bataev <5361294+alexey-bataev@users.noreply.github.com> |
[TTI]Add support for strided loads/stores.
Added basic legality check and cost estimation functions for strided loads and stores.
These interfaces will be built upon in https://github.com/llvm/llvm
[TTI]Add support for strided loads/stores.
Added basic legality check and cost estimation functions for strided loads and stores.
These interfaces will be built upon in https://github.com/llvm/llvm-project/pull/80310.
Reviewers: preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/80329
show more ...
|
Revision tags: llvmorg-18.1.0-rc1, llvmorg-19-init |
|
#
a2d68b4b |
| 22-Jan-2024 |
David Green <david.green@arm.com> |
[SelectOpt] Add handling for Select-like operations. (#77284)
Some operations behave like selects. For example `or(zext(c), y)` is the
same as select(c, y|1, y)` and instcombine can canonicalize th
[SelectOpt] Add handling for Select-like operations. (#77284)
Some operations behave like selects. For example `or(zext(c), y)` is the
same as select(c, y|1, y)` and instcombine can canonicalize the select
to the or form. These operations can still be worthwhile converting to
branch as opposed to keeping as a select or or instruction.
This patch attempts to add some basic handling for them, creating a
SelectLike abstraction in the select optimization pass. The backend can
opt into handling `or(zext(c),x)` as a select if it could be profitable,
and the select optimization pass attempts to handle them in much the
same way as a `select(c, x|1, x)`. The Or(x, 1) may need to be added as
a new instruction, generated as the or is converted to branches.
This helps fix a regression from selects being converted to or's
recently.
show more ...
|
#
c7148467 |
| 09-Jan-2024 |
David Sherwood <57997763+david-arm@users.noreply.github.com> |
[AArch64] Add an AArch64 pass for loop idiom transformations (#72273)
We have added a new pass that looks for loops such as the following:
```
while (i != max_len)
if (a[i] != b[i])
[AArch64] Add an AArch64 pass for loop idiom transformations (#72273)
We have added a new pass that looks for loops such as the following:
```
while (i != max_len)
if (a[i] != b[i])
break;
... use index i ...
```
Although similar to a memcmp, this is slightly different because instead
of returning the difference between the values of the first non-matching
pair of bytes, it returns the index of the first mismatch. As such, we
are not able to lower this to a memcmp call.
The new pass can now spot such idioms and transform them into a
specialised predicated loop that gives a significant performance
improvement for AArch64. It is intended as a stop-gap solution until
this can be handled by the vectoriser, which doesn't currently deal with
early exits.
This specialised loop makes use of a generic intrinsic that counts the
trailing zero elements in a predicate vector. This was added in
https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp
instructions.
Although we have added this pass only for AArch64, it was written in a
generic way so that in theory it could be used by other targets.
Currently the pass requires scalable vector support and needs to know
the minimum page size for the target, however it's possible to make it
work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts
intrinsic used by the pass has generic lowering, but can be made
efficient for targets with instructions similar to SVE's brkb, cntp and
incp.
Original version of patch was posted on Phabricator:
https://reviews.llvm.org/D158291
Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David
Sherwood (@david-arm)
See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383
show more ...
|
#
50965010 |
| 27-Dec-2023 |
Alexey Bataev <5361294+alexey-bataev@users.noreply.github.com> |
[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)
SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added
[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)
SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added (seeTTI::isLegalAltInstr), but the cost still did not estimated properly.
show more ...
|
#
fb981e6b |
| 28-Dec-2023 |
Douglas Yung <douglas.yung@sony.com> |
Revert "[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)"
This reverts commit bc8c4bbd7973ab9527a78a20000aecde9bed652d.
Change is failing to build on several bots: - https://lab.llvm.org
Revert "[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)"
This reverts commit bc8c4bbd7973ab9527a78a20000aecde9bed652d.
Change is failing to build on several bots: - https://lab.llvm.org/buildbot/#/builders/127/builds/60184 - https://lab.llvm.org/buildbot/#/builders/123/builds/23709 - https://lab.llvm.org/buildbot/#/builders/216/builds/32302
show more ...
|
#
bc8c4bbd |
| 27-Dec-2023 |
Alexey Bataev <5361294+alexey-bataev@users.noreply.github.com> |
[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)
SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added
[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)
SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added (seeTTI::isLegalAltInstr), but the cost still did not estimated properly.
show more ...
|
#
930b5b52 |
| 13-Dec-2023 |
Paul Walker <paul.walker@arm.com> |
[ConstantHoisting] Add a TTI hook to prevent hoisting. (#69004)
Code generation can sometimes simplify expensive operations when
an operand is constant. An example of this is divides on AArch64
w
[ConstantHoisting] Add a TTI hook to prevent hoisting. (#69004)
Code generation can sometimes simplify expensive operations when
an operand is constant. An example of this is divides on AArch64
where they can be rewritten using a cheaper sequence of multiplies
and subtracts. Doing this is often better than hoisting expensive
constants which are likely to be hoisted by MachineLICM anyway.
show more ...
|
#
e947f953 |
| 29-Nov-2023 |
Philip Reames <preames@rivosinc.com> |
[LSR][TTI][RISCV] Enable terminator folding for RISC-V
If looking for a miscompile revert candidate, look here!
The transform being enabled prefers comparing to a loop invariant exit value for a se
[LSR][TTI][RISCV] Enable terminator folding for RISC-V
If looking for a miscompile revert candidate, look here!
The transform being enabled prefers comparing to a loop invariant exit value for a secondary IV over using an otherwise dead primary IV. This increases register pressure (by requiring the exit value to be live through the loop), but reduces the number of instructions within the loop by one.
On RISC-V which has a large number of scalar registers, this is generally a profitable transform. We loose the ability to use a beqz on what is typically a count down IV, and pay the cost of computing the exit value on the secondary IV in the loop preheader, but save an add or sub in the loop body. For anything except an extremely short running loop, or one with extreme register pressure, this is profitable. On spec2017, we see a 0.42% geomean improvement in dynamic icount, with no individual workload regressing by more than 0.25%.
Code size wise, we trade a (possibly compressible) beqz and a (possibly compressible) addi for a uncompressible beq. We also add instructions in the preheader. Net result is a slight regression overall, but neutral or better inside the loop.
Previous versions of this transform had numerous cornercase correctness bugs. All of them ones I can spot by inspection have been fixed, and I have run this through all of spec2017, but there may be further issues lurking. Adding uses to an IV is a fraught thing to do given poison semantics, so this transform is somewhat inherently risky.
This patch is a reworked version of D134893 by @eop. That patch has been abandoned since May, so I picked it up, reworked it a bit, and am landing it.
show more ...
|