#
18f8106f |
| 29-Jan-2025 |
Joel E. Denny <jdenny.ornl@gmail.com> |
[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944)
This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ul
[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944)
This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.
By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
show more ...
|
Revision tags: llvmorg-21-init |
|
#
831527a5 |
| 17-Jan-2025 |
Alexandros Lamprineas <alexandros.lamprineas@arm.com> |
[FMV][GlobalOpt] Statically resolve calls to versioned functions. (#87939)
To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The
[FMV][GlobalOpt] Statically resolve calls to versioned functions. (#87939)
To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The criteria for bypassing
the resolver are the following:
* If the callee's feature set is a subset of the caller's feature set,
then the callee is a candidate for direct call.
* Among such candidates the one of highest priority is the best match
and it shall be picked, unless there is a version of the callee with
higher priority than the best match which cannot be picked from a
higher priority caller (directly or through the resolver).
* For every higher priority callee version than the best match, there
is a higher priority caller version whose feature set availability
is implied by the callee's feature set.
Example:
Callers and Callees are ordered in decreasing priority.
The arrows indicate successful call redirections.
Caller Callee Explanation
=========================================================================
mops+sve2 --+--> mops all the callee versions are subsets of the
| caller but mops has the highest priority
|
mops --+ sve2 between mops and default callees, mops wins
sve sve between sve and default callees, sve wins
but sve2 does not have a high priority caller
default -----> default sve (callee) implies sve (caller),
sve2(callee) implies sve (caller),
mops(callee) implies mops(caller)
show more ...
|
Revision tags: llvmorg-19.1.7 |
|
#
795e35a6 |
| 13-Jan-2025 |
Sam Tebbs <samuel.tebbs@arm.com> |
Reland "[LoopVectorizer] Add support for partial reductions" with non-phi operand fix. (#121744)
This relands the reverted #120721 with a fix for cases where neither reduction operand are the reduct
Reland "[LoopVectorizer] Add support for partial reductions" with non-phi operand fix. (#121744)
This relands the reverted #120721 with a fix for cases where neither reduction operand are the reduction phi. Only 63114239cc8d26225a0ef9920baacfc7cc00fc58 and 63114239cc8d26225a0ef9920baacfc7cc00fc58 are new on top of the reverted PR.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
show more ...
|
#
4d8f9594 |
| 27-Dec-2024 |
Zequan Wu <zequanwu@google.com> |
Revert "Reland "[LoopVectorizer] Add support for partial reductions" (#120721)"
This reverts commit c858bf620c3ab2a4db53e84b9365b553c3ad1aa6 as it casuse optimization crash on -O2, see https://githu
Revert "Reland "[LoopVectorizer] Add support for partial reductions" (#120721)"
This reverts commit c858bf620c3ab2a4db53e84b9365b553c3ad1aa6 as it casuse optimization crash on -O2, see https://github.com/llvm/llvm-project/pull/120721#issuecomment-2563192057
show more ...
|
#
c858bf62 |
| 24-Dec-2024 |
Sam Tebbs <samuel.tebbs@arm.com> |
Reland "[LoopVectorizer] Add support for partial reductions" (#120721)
This re-lands the reverted #92418
When the VF is small enough so that dividing the VF by the scaling factor results in 1, the
Reland "[LoopVectorizer] Add support for partial reductions" (#120721)
This re-lands the reverted #92418
When the VF is small enough so that dividing the VF by the scaling factor results in 1, the reduction phi execution thinks the VF is scalar and sets the reduction's output as a scalar value, tripping assertions expecting a vector value. The latest commit in this PR fixes that by using `State.VF` in the scalar check, rather than the divided VF.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
show more ...
|
#
5f096fd2 |
| 19-Dec-2024 |
Florian Hahn <flo@fhahn.com> |
Revert "[LoopVectorizer] Add support for partial reductions (#92418)"
This reverts commit 060d62b48aeb5080ffcae1dc56e41a06c6f56701.
It looks like this is triggering an assertion when build llvm-tes
Revert "[LoopVectorizer] Add support for partial reductions (#92418)"
This reverts commit 060d62b48aeb5080ffcae1dc56e41a06c6f56701.
It looks like this is triggering an assertion when build llvm-test-suite on ARM64 macOS.
Reproducer from MultiSource/Benchmarks/Ptrdist/bc/number.c
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-n32:64-S128-Fn32" target triple = "arm64-apple-macosx15.0.0"
define void @test(i64 %idx.neg, i8 %0) #0 { entry: br label %while.body
while.body: ; preds = %while.body, %entry %n1ptr.0.idx131 = phi i64 [ %n1ptr.0.add, %while.body ], [ %idx.neg, %entry ] %n2ptr.0.idx130 = phi i64 [ %n2ptr.0.add, %while.body ], [ 0, %entry ] %sum.1129 = phi i64 [ %add99, %while.body ], [ 0, %entry ] %n1ptr.0.add = add i64 %n1ptr.0.idx131, 1 %conv = sext i8 %0 to i64 %n2ptr.0.add = add i64 %n2ptr.0.idx130, 1 %1 = load i8, ptr null, align 1 %conv97 = sext i8 %1 to i64 %mul = mul i64 %conv97, %conv %add99 = add i64 %mul, %sum.1129 %cmp94 = icmp ugt i64 %n1ptr.0.idx131, 0 %cmp95 = icmp ne i64 %n2ptr.0.idx130, -1 %2 = and i1 %cmp94, %cmp95 br i1 %2, label %while.body, label %while.end.loopexit
while.end.loopexit: ; preds = %while.body %add99.lcssa = phi i64 [ %add99, %while.body ] ret void }
attributes #0 = { "target-cpu"="apple-m1" }
> opt -p loop-vectorize Assertion failed: ((VF.isScalar() || V->getType()->isVectorTy()) && "scalar values must be stored as (0, 0)"), function set, file VPlan.h, line 284.
show more ...
|
#
45c01e8a |
| 19-Dec-2024 |
Finn Plummer <50529406+inbelic@users.noreply.github.com> |
[NFC][TargetTransformInfo][VectorUtils] Consolidate `isVectorIntrinsic...` api (#117635)
- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of
[NFC][TargetTransformInfo][VectorUtils] Consolidate `isVectorIntrinsic...` api (#117635)
- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of target specific intrinsics
- add TTI to the `isVectorIntrinsicWithStructReturnOverloadAtField` api
- update TTI api to provide `isTargetIntrinsicWith...` functions and
consistently name them
- move `isTriviallyScalarizable` to VectorUtils
- update all uses of the api and provide the TTI parameter
Resolves #117030
show more ...
|
#
060d62b4 |
| 19-Dec-2024 |
Nicholas Guy <nicholas.guy@arm.com> |
[LoopVectorizer] Add support for partial reductions (#92418)
Following on from https://github.com/llvm/llvm-project/pull/94499, this patch adds support to the Loop Vectorizer to emit the partial red
[LoopVectorizer] Add support for partial reductions (#92418)
Following on from https://github.com/llvm/llvm-project/pull/94499, this patch adds support to the Loop Vectorizer to emit the partial reduction intrinsics where they may be beneficial for the target.
---------
Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>
show more ...
|
Revision tags: llvmorg-19.1.6, llvmorg-19.1.5 |
|
#
0ad6be19 |
| 29-Nov-2024 |
Jonas Paulsson <paulson1@linux.ibm.com> |
[SLPVectorizer, TargetTransformInfo, SystemZ] Improve SLP getGatherCost(). (#112491)
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to ref
[SLPVectorizer, TargetTransformInfo, SystemZ] Improve SLP getGatherCost(). (#112491)
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.
getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().
SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
show more ...
|
#
8663b877 |
| 21-Nov-2024 |
Finn Plummer <50529406+inbelic@users.noreply.github.com> |
[NFC][VectorUtils][TargetTransformInfo] Add `isVectorIntrinsicWithOverloadTypeAtArg` api (#114849)
This changes allows target intrinsics to specify and overwrite overloaded types.
- Updates `Repl
[NFC][VectorUtils][TargetTransformInfo] Add `isVectorIntrinsicWithOverloadTypeAtArg` api (#114849)
This changes allows target intrinsics to specify and overwrite overloaded types.
- Updates `ReplaceWithVecLib` to not provide TTI as there most probably won't be a use-case
- Updates `SLPVectorizer` to use available TTI
- Updates `VPTransformState` to pass down TTI
- Updates `VPlanRecipe` to use passed-down TTI
This change will let us add scalarization for `asdouble`: #114847
show more ...
|
#
9bccf61f |
| 20-Nov-2024 |
Sjoerd Meijer <smeijer@nvidia.com> |
[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385)
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of
[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385)
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
show more ...
|
Revision tags: llvmorg-19.1.4 |
|
#
9991ea28 |
| 13-Nov-2024 |
Sushant Gokhale <sgokhale@nvidia.com> |
[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479)
…er possible
In case of Neon, if there exists extractelement from lane != 0 such that
1. extractelement does no
[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479)
…er possible
In case of Neon, if there exists extractelement from lane != 0 such that
1. extractelement does not necessitate a move from vector_reg -> GPR
2. extractelement result feeds into fmul
3. Other operand of fmul is a scalar or extractelement from lane 0 or
lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it
incurs no cost.
e.g.
```
define double @foo(<2 x double> %a) {
%1 = extractelement <2 x double> %a, i32 0
%2 = extractelement <2 x double> %a, i32 1
%res = fmul double %1, %2
ret double %res
}
```
`%2` and `%res` can be merged in the backend to generate:
`fmul d0, d0, v0.d[1]`
The change was tested with SPEC FP(C/C++) on Neoverse-v2.
**Compile time impact**: None
**Performance impact**: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.
show more ...
|
#
236fda55 |
| 06-Nov-2024 |
Kazu Hirata <kazu@google.com> |
[Analysis] Remove unused includes (NFC) (#114936)
Identified with misc-include-cleaner.
|
Revision tags: llvmorg-19.1.3 |
|
#
e37d736d |
| 24-Oct-2024 |
Nashe Mncube <nashe.mncube@arm.com> |
Recommit: [llvm][ARM][GlobalOpt]Add widen global arrays pass (#113289)
This is a recommit of #107120 . The original PR was approved but failed
buildbot. The newly added tests should only be run fo
Recommit: [llvm][ARM][GlobalOpt]Add widen global arrays pass (#113289)
This is a recommit of #107120 . The original PR was approved but failed
buildbot. The newly added tests should only be run for compilers that
support the ARM target. This has been resolved by adding a config file
for these tests.
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make ARM backend generate full word loads instead of
loading a single byte (ldrb) and/or half word (ldrh). Only pads
destination when it's a stack allocated constant size array and source
when it's constant string. Heuristic to decide whether to pad or not
is very basic and could be improved to allow more examples to be
padded.
- Pass works at the midend level
show more ...
|
#
370fd743 |
| 17-Oct-2024 |
Nashe Mncube <nashe.mncube@arm.com> |
Revert "[llvm][ARM]Add widen global arrays pass" (#112701)
Reverts llvm/llvm-project#107120
Unexpected build failures in post-commit pipelines. Needs investigation
|
#
ab90d279 |
| 17-Oct-2024 |
Nashe Mncube <nashe.mncube@arm.com> |
[llvm][ARM]Add widen global arrays pass (#107120)
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make backend generate full word loads instead of loading a
si
[llvm][ARM]Add widen global arrays pass (#107120)
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make backend generate full word loads instead of loading a
single byte (ldrb) and/or half word (ldrh). Only pads destination when
it's a stack allocated constant size array and source when it's constant
array. Heuristic to decide whether to pad or not is very basic and could
be improved to allow more examples to be padded.
- Pass works within GlobalOpt but is disabled by default on all targets
except ARM.
show more ...
|
Revision tags: llvmorg-19.1.2 |
|
#
f9bc00e4 |
| 14-Oct-2024 |
Alexey Bataev <a.bataev@outlook.com> |
[SLP]Initial support for interleaved loads
Adds initial support for interleaved loads, which allows emission of segmented loads for RISCV RVV.
Vectorizes extra code for RISCV CFP2006/447.dealII, CF
[SLP]Initial support for interleaved loads
Adds initial support for interleaved loads, which allows emission of segmented loads for RISCV RVV.
Vectorizes extra code for RISCV CFP2006/447.dealII, CFP2006/453.povray, CFP2017rate/510.parest_r, CFP2017rate/511.povray_r, CFP2017rate/526.blender_r, CFP2017rate/538.imagick_r, CINT2006/403.gcc, CINT2006/473.astar, CINT2017rate/502.gcc_r, CINT2017rate/525.x264_r
Reviewers: RKSimon, preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/112042
show more ...
|
#
76007138 |
| 12-Oct-2024 |
Tim Renouf <tim.renouf@amd.com> |
[LLVM] New NoDivergenceSource function attribute (#111832)
A call to a function that has this attribute is not a source of
divergence, as used by UniformityAnalysis. That allows a front-end to
use
[LLVM] New NoDivergenceSource function attribute (#111832)
A call to a function that has this attribute is not a source of
divergence, as used by UniformityAnalysis. That allows a front-end to
use known-name calls as an instruction extension mechanism (e.g.
https://github.com/GPUOpen-Drivers/llvm-dialects ) without such a call
being a source of divergence.
show more ...
|
#
e34e27f1 |
| 11-Oct-2024 |
Shilei Tian <i@tianshilei.me> |
[TTI][AMDGPU] Allow targets to adjust `LastCallToStaticBonus` via `getInliningLastCallToStaticBonus` (#111311)
Currently we will not be able to inline a large function even if it only
has one live
[TTI][AMDGPU] Allow targets to adjust `LastCallToStaticBonus` via `getInliningLastCallToStaticBonus` (#111311)
Currently we will not be able to inline a large function even if it only
has one live use because the inline cost is still very high after
applying `LastCallToStaticBonus`, which is a constant. This could
significantly impact the performance because CSR spill is very
expensive.
This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to
allow targets to customize this value.
Fixes SWDEV-471398.
show more ...
|
#
853c43d0 |
| 09-Oct-2024 |
Jeffrey Byrnes <jeffrey.byrnes@amd.com> |
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
show more ...
|
#
63a0a81e |
| 07-Oct-2024 |
Farzon Lotfi <1802579+farzonl@users.noreply.github.com> |
[NFC][Scalarizer][TargetTransformInfo] Add isTargetIntrinsicWithScalarOpAtArg api (#111441)
This change allows target intrinsics can have scalar args
fixes [111440](https://github.com/llvm/llvm-p
[NFC][Scalarizer][TargetTransformInfo] Add isTargetIntrinsicWithScalarOpAtArg api (#111441)
This change allows target intrinsics can have scalar args
fixes [111440](https://github.com/llvm/llvm-project/issues/111440)
This change will let us add scalarization for WaveReadLaneAt:
https://github.com/llvm/llvm-project/pull/111010
show more ...
|
Revision tags: llvmorg-19.1.1 |
|
#
d2885743 |
| 25-Sep-2024 |
Philip Reames <preames@rivosinc.com> |
[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824)
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares
[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824)
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
show more ...
|
#
0f97b482 |
| 17-Sep-2024 |
Farzon Lotfi <1802579+farzonl@users.noreply.github.com> |
[Scalarizer][DirectX] Add support for scalarization of Target intrinsics (#108776)
Since we are using the Scalarizer pass in the backend we needed a way to
allow this pass to operate on Target int
[Scalarizer][DirectX] Add support for scalarization of Target intrinsics (#108776)
Since we are using the Scalarizer pass in the backend we needed a way to
allow this pass to operate on Target intrinsics.
We achieved this by adding `TargetTransformInfo ` to the Scalarizer
pass. This allowed us to call a function available to the DirectX
backend to know if an intrinsic is a target intrinsic that should be
scalarized.
show more ...
|
Revision tags: llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3 |
|
#
27a62ec7 |
| 18-Aug-2024 |
Philip Reames <preames@rivosinc.com> |
[LSR] Split the -lsr-term-fold transformation into it's own pass (#104234)
This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV. Split
[LSR] Split the -lsr-term-fold transformation into it's own pass (#104234)
This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV. Splitting it out makes
it easier to test.
Note that long term I would like to write a version of this transform
which *is* integrated with LSR's solver, but if that happens, we'll
just delete the extra pass.
Integration wise, I switched from using TTI to using a pass configuration
variable. This seems slightly more idiomatic, and means we don't run
the extra logic on any target other than RISCV.
show more ...
|
#
bde24325 |
| 08-Aug-2024 |
Jeremy Morse <jeremy.morse@sony.com> |
Revert "[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)"
This reverts commit e8ad87c7d06afe8f5dde2e4c7f13c314cb3a99e9. This reverts commit d3c9bb0cf811
Revert "[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)"
This reverts commit e8ad87c7d06afe8f5dde2e4c7f13c314cb3a99e9. This reverts commit d3c9bb0cf811424dcb8c848cf06773dbdde19965.
A few buildbots trip up on asan-rvv-intrinsics.ll. I've also reverted the follow-up commit d3c9bb0cf8.
https://lab.llvm.org/buildbot/#/builders/46/builds/2895
show more ...
|