#
b81f1e0d |
| 29-Mar-2016 |
Adam Nemet <anemet@apple.com> |
[PPC] Remove -ppc-loop-prefetch-distance in favor of -prefetch-distance
After the previous change, this can now be overridden centrally in the pass.
llvm-svn: 264807
|
#
fa7057a4 |
| 29-Mar-2016 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Refactor popcnt[dw] target features
Instead of using two feature bits, one to indicate the availability of the popcnt[dw] instructions, and another to indicate whether or not they're fast,
[PowerPC] Refactor popcnt[dw] target features
Instead of using two feature bits, one to indicate the availability of the popcnt[dw] instructions, and another to indicate whether or not they're fast, use a single enum. This allows more consistent control via target attribute strings, and via Clang's command line.
llvm-svn: 264690
show more ...
|
#
69ada2f5 |
| 28-Mar-2016 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Clarify a comment in PPCTTI about vector loads
This should say that we could do unaligned vector loads on the P7 using VSX instructions, not that we should.
llvm-svn: 264683
|
#
7059d416 |
| 28-Mar-2016 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] On the A2, popcnt[dw] are very slow
The A2 cores support the popcntw/popcntd instructions, but they're microcoded, and slower than our default software emulation. Specifically, popcnt[dw]
[PowerPC] On the A2, popcnt[dw] are very slow
The A2 cores support the popcntw/popcntd instructions, but they're microcoded, and slower than our default software emulation. Specifically, popcnt[dw] take approximately 74 cycles, whereas our software emulation takes only 24-28 cycles.
I've added a new target feature to indicate a slow popcnt[dw], instead of just removing the existing target feature from the a2/a2q processor models, because: 1. This allows us to return more accurate information via the TTI interface (I recognize that this currently makes no practical difference) 2. Is hopefully easier to understand (it allows the core's features to match its manual while still having the desired effect).
llvm-svn: 264600
show more ...
|
Revision tags: llvmorg-3.8.0, llvmorg-3.8.0-rc3, llvmorg-3.8.0-rc2 |
|
#
dadfbb52 |
| 27-Jan-2016 |
Adam Nemet <anemet@apple.com> |
[TTI] Add getPrefetchDistance from PPCLoopDataPrefetch, NFC
This patch is part of the work to make PPCLoopDataPrefetch target-independent (http://thread.gmane.org/gmane.comp.compilers.llvm.devel/927
[TTI] Add getPrefetchDistance from PPCLoopDataPrefetch, NFC
This patch is part of the work to make PPCLoopDataPrefetch target-independent (http://thread.gmane.org/gmane.comp.compilers.llvm.devel/92758).
As it was discussed in the above thread, getPrefetchDistance is currently using instruction count which may change in the future.
llvm-svn: 258995
show more ...
|
#
af761104 |
| 21-Jan-2016 |
Adam Nemet <anemet@apple.com> |
[TTI] Add getCacheLineSize
Summary: And use it in PPCLoopDataPrefetch.cpp.
@hfinkel, please let me know if your preference would be to preserve the ppc-loop-prefetch-cache-line option in order to b
[TTI] Add getCacheLineSize
Summary: And use it in PPCLoopDataPrefetch.cpp.
@hfinkel, please let me know if your preference would be to preserve the ppc-loop-prefetch-cache-line option in order to be able to override the value of TTI::getCacheLineSize for PPC.
Reviewers: hfinkel
Subscribers: hulx2000, mcrosier, mssimpso, hfinkel, llvm-commits
Differential Revision: http://reviews.llvm.org/D16306
llvm-svn: 258419
show more ...
|
Revision tags: llvmorg-3.8.0-rc1, llvmorg-3.7.1, llvmorg-3.7.1-rc2, llvmorg-3.7.1-rc1 |
|
#
4a7be239 |
| 04-Sep-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Enable interleaved-access vectorization
This adds a basic cost model for interleaved-access vectorization (and a better default for shuffles), and enables interleaved-access vectorization
[PowerPC] Enable interleaved-access vectorization
This adds a basic cost model for interleaved-access vectorization (and a better default for shuffles), and enables interleaved-access vectorization by default. The relevant difference from the default cost model for interleaved-access vectorization, is that on PPC, the shuffles that end up being used are *much* cheaper than modeling the process with insert/extract pairs (which are quite expensive, especially on older cores).
llvm-svn: 246824
show more ...
|
#
75afa2b6 |
| 03-Sep-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Always use aggressive interleaving on the A2
On the A2, with an eye toward QPX unaligned-load merging, we should always use aggressive interleaving. It is generally superior to only using
[PowerPC] Always use aggressive interleaving on the A2
On the A2, with an eye toward QPX unaligned-load merging, we should always use aggressive interleaving. It is generally superior to only using concatenation unrolling.
llvm-svn: 246819
show more ...
|
#
f11bc761 |
| 03-Sep-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Include the permutation cost for unaligned vector loads
Pre-P8, when we generate code for unaligned vector loads (for Altivec and QPX types), even when accounting for the combining that ta
[PowerPC] Include the permutation cost for unaligned vector loads
Pre-P8, when we generate code for unaligned vector loads (for Altivec and QPX types), even when accounting for the combining that takes place for multiple consecutive such loads, there is at least one load instructions and one permutation for each load. Make sure the cost reported reflects the cost of the permutes as well.
llvm-svn: 246807
show more ...
|
#
79dbf5b5 |
| 02-Sep-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Cleanup cost model for unaligned vector loads/stores
I'm adding a regression test to better cover code generation for unaligned vector loads and stores, but there's no functional change to
[PowerPC] Cleanup cost model for unaligned vector loads/stores
I'm adding a regression test to better cover code generation for unaligned vector loads and stores, but there's no functional change to the code generation here. There is an improvement to the cost model for unaligned vector loads and stores, mostly for QPX (for which we were not previously accounting for the permutation-based loads), and the cost model implementation is cleaner.
llvm-svn: 246712
show more ...
|
Revision tags: llvmorg-3.7.0, llvmorg-3.7.0-rc4, llvmorg-3.7.0-rc3, studio-1.4 |
|
#
93205eb9 |
| 05-Aug-2015 |
Chandler Carruth <chandlerc@gmail.com> |
[TTI] Make the cost APIs in TargetTransformInfo consistently use 'int' rather than 'unsigned' for their costs.
For something like costs in particular there is a natural "negative" value, that of sav
[TTI] Make the cost APIs in TargetTransformInfo consistently use 'int' rather than 'unsigned' for their costs.
For something like costs in particular there is a natural "negative" value, that of savings or saved cost. As a consequence, there is a lot of code that subtracts or creates negative values based on cost, all of which is prone to awkwardness or bugs when dealing with an unsigned type. Similarly, we *never* want these values to wrap, as that would cause Very Bad code generation (likely percieved as an infinite loop as we try to emit over 2^32 instructions or some such insanity).
All around 'int' seems a much better fit for these basic metrics. I've added asserts to ensure that at least the TTI interface never returns negative numbers here. If we ever have a use case for negative numbers, we can remove this, but this way a bug where someone used '-1' to produce a 'very large' cost will be caught by the assert.
This passes all tests, and is also UBSan clean.
No functional change intended.
Differential Revision: http://reviews.llvm.org/D11741
llvm-svn: 244080
show more ...
|
Revision tags: llvmorg-3.7.0-rc2, llvmorg-3.7.0-rc1 |
|
#
44ede33a |
| 09-Jul-2015 |
Mehdi Amini <mehdi.amini@apple.com> |
Make TargetLowering::getPointerTy() taking DataLayout as an argument
Summary: This change is part of a series of commits dedicated to have a single DataLayout during compilation by using always the
Make TargetLowering::getPointerTy() taking DataLayout as an argument
Summary: This change is part of a series of commits dedicated to have a single DataLayout during compilation by using always the one owned by the module.
Reviewers: echristo
Subscribers: jholewinski, ted, yaron.keren, rafael, llvm-commits
Differential Revision: http://reviews.llvm.org/D11028
From: Mehdi Amini <mehdi.amini@apple.com> llvm-svn: 241775
show more ...
|
Revision tags: llvmorg-3.6.2, llvmorg-3.6.2-rc1 |
|
#
3b3c9c3e |
| 21-May-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PPC/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the PPC/A2
On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable
[PPC/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the PPC/A2
On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of a division to compute the trip count.
llvm-svn: 237947
show more ...
|
Revision tags: llvmorg-3.6.1, llvmorg-3.6.1-rc1 |
|
#
062c7448 |
| 06-May-2015 |
Wei Mi <wmi@google.com> |
[X86] Disable loop unrolling in loop vectorization pass when VF is 1.
The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture, by setting MaxInterleaveFactor to 1. Unr
[X86] Disable loop unrolling in loop vectorization pass when VF is 1.
The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture, by setting MaxInterleaveFactor to 1. Unrolling in loop vectorization pass may introduce the cost of overflow check, memory boundary check and extra prologue/epilogue code when regular unroller will unroll the loop another time. Disable it when VF==1 remove the unnecessary cost on x86. The same can be done for other platforms after verifying interleaving/memory bound checking to be not perf critical on those platforms.
Differential Revision: http://reviews.llvm.org/D9515
llvm-svn: 236613
show more ...
|
Revision tags: llvmorg-3.5.2, llvmorg-3.5.2-rc1 |
|
#
049d803c |
| 06-Mar-2015 |
Olivier Sallenave <ohsallen@us.ibm.com> |
Do not restrict interleaved unrolling to small loops, depending on the target.
llvm-svn: 231528
|
Revision tags: llvmorg-3.6.0 |
|
#
c93a9a2c |
| 25-Feb-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes wide, holding 4 double-precision floating-point values. Boolean values, modeled here as <4 x i1> are actually also represented as floating-point values (essentially { -1, 1 } for { false, true }). QPX shares many features with Altivec and VSX, but is distinct from both of them. One major difference is that, instead of adding completely-separate vector registers, QPX vector registers are extensions of the scalar floating-point registers (lane 0 is the corresponding scalar floating-point value). The operations supported on QPX vectors mirrors that supported on the scalar floating-point values (with some additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project, for several years. This is not the entire bgclang patch set, but is most of the subset that can be cleanly integrated into LLVM proper at this time. Adding this to the LLVM backend is part of my efforts to rebase bgclang to the current LLVM trunk, but is independently useful (especially for codes that use LLVM as a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
show more ...
|
Revision tags: llvmorg-3.6.0-rc4, llvmorg-3.6.0-rc3 |
|
#
05e69157 |
| 12-Feb-2015 |
Olivier Sallenave <ohsallen@us.ibm.com> |
Change max interleave factor to 12 for POWER7 and POWER8.
llvm-svn: 228973
|
#
ab5cb36c |
| 01-Feb-2015 |
Chandler Carruth <chandlerc@gmail.com> |
[multiversion] Remove the function parameter from the unrolling preferences interface on TTI now that all of TTI is per-function.
llvm-svn: 227741
|
#
c956ab66 |
| 01-Feb-2015 |
Chandler Carruth <chandlerc@gmail.com> |
[multiversion] Switch the TTI queries from TargetMachine to Subtarget now that we have a correct and cached subtarget specific to the function.
Also, finish providing a cached per-function subtarget
[multiversion] Switch the TTI queries from TargetMachine to Subtarget now that we have a correct and cached subtarget specific to the function.
Also, finish providing a cached per-function subtarget in the core LLVMTargetMachine -- that layer hadn't switched over yet.
The only use of the TargetMachine was to re-lookup a subtarget for a particular function to work around the fact that TTI was immutable. Now that it is per-function and we haved a cached subtarget, use it.
This still leaves a few interfaces with real warts on them where we were passing Function objects through the TTI interface. I'll remove these and clean their usage up in subsequent commits now that this isn't necessary.
llvm-svn: 227738
show more ...
|
#
93dcdc47 |
| 31-Jan-2015 |
Chandler Carruth <chandlerc@gmail.com> |
[PM] Switch the TargetMachine interface from accepting a pass manager base which it adds a single analysis pass to, to instead return the type erased TargetTransformInfo object constructed for that T
[PM] Switch the TargetMachine interface from accepting a pass manager base which it adds a single analysis pass to, to instead return the type erased TargetTransformInfo object constructed for that TargetMachine.
This removes all of the pass variants for TTI. There is now a single TTI *pass* in the Analysis layer. All of the Analysis <-> Target communication is through the TTI's type erased interface itself. While the diff is large here, it is nothing more that code motion to make types available in a header file for use in a different source file within each target.
I've tried to keep all the doxygen comments and file boilerplate in line with this move, but let me know if I missed anything.
With this in place, the next step to making TTI work with the new pass manager is to introduce a really simple new-style analysis that produces a TTI object via a callback into this routine on the target machine. Once we have that, we'll have the building blocks necessary to accept a function argument as well.
llvm-svn: 227685
show more ...
|
#
705b185f |
| 31-Jan-2015 |
Chandler Carruth <chandlerc@gmail.com> |
[PM] Change the core design of the TTI analysis to use a polymorphic type erased interface and a single analysis pass rather than an extremely complex analysis group.
The end result is that the TTI
[PM] Change the core design of the TTI analysis to use a polymorphic type erased interface and a single analysis pass rather than an extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased implementation that supports the polymorphic TTI interface. We can build one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes, including CRTP base classes to facilitate calling back up to the most specialized form when delegating horizontally across the surface. These aren't as clean as I would like and I'm planning to work on cleaning some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular design. The first and foremost reason is that an analysis group is complete overkill, and the chaining delegation strategy was so opaque, confusing, and high overhead that TTI was suffering greatly for it. Several of the TTI functions had failed to be implemented in all places because of the chaining-based delegation making there be no checking of this. A few other functions were implemented with incorrect delegation. The message to me was very clear working on this -- the delegation and analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for the new pass manager. This will lay the ground work for a type-erased per-function info object that can look up the correct subtarget and even cache it.
Yet another benefit is that this will significantly simplify the interaction of the pass managers and the TargetMachine. See the future work below.
The downside of this change is that it is very, very verbose. I'm going to work to improve that, but it is somewhat an implementation necessity in C++ to do type erasure. =/ I discussed this design really extensively with Eric and Hal prior to going down this path, and afterward showed them the result. No one was really thrilled with it, but there doesn't seem to be a substantially better alternative. Using a base class and virtual method dispatch would make the code much shorter, but as discussed in the update to the programmer's manual and elsewhere, a polymorphic interface feels like the more principled approach even if this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the huge chunk that I couldn't really split things out of because this was the interface change to TTI. I've tried to minimize all the other parts of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return a TTI object. Because we have a non-pass object with value semantics and an internal type erasure mechanism, we can narrow the interface of the TargetMachine to *just* do what we need: build and return a TTI object that we can then insert into the pass pipeline. 2) Make the TTI object be fully specialized for a particular function. This will include splitting off a minimal form of it which is sufficient for the inliner and the old pass manager. 3) Add a new pass manager analysis which produces TTI objects from the target machine for each function. This may actually be done as part of #2 in order to use the new analysis to implement #2. 4) Work on narrowing the API between TTI and the targets so that it is easier to understand and less verbose to type erase. 5) Work on narrowing the API between TTI and its clients so that it is easier to understand and less verbose to forward. 6) Try to improve the CRTP-based delegation. I feel like this code is just a bit messy and exacerbating the complexity of implementing the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on this somewhat more abruptly than I expected, and so I appreciate getting it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
show more ...
|
Revision tags: llvmorg-3.6.0-rc2, llvmorg-3.6.0-rc1 |
|
#
934361a4 |
| 14-Jan-2015 |
Hal Finkel <hfinkel@anl.gov> |
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along with the preceding fix to ScheduleDAGSDN
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs. These problems caused the original regression tests to assert/segfault on many (but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this common code for lowering some common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is very similar to the support in the AArch64 target, with the obvious differences (different registers, NOP instructions, etc.). The test cases are adapted from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so you can't use less than that (on AArch64 you can go down to 16). Also, as noted in the docs, we take the patchpoint address to be the actual code address (assuming the call is local in the TOC-sharing sense), which should yield higher performance than generating the full cross-DSO indirect-call sequence and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
show more ...
|
#
63fb9281 |
| 13-Jan-2015 |
Hal Finkel <hfinkel@anl.gov> |
Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support"
Reverting this while I investiage buildbot failures (segfaulting in GetCostForDef at ScheduleDAGRRList.cpp:314).
llvm-svn: 225811
|
#
821befd5 |
| 13-Jan-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Add StackMap/PatchPoint support
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this
[PowerPC] Add StackMap/PatchPoint support
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this common code for lowering some common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is very similar to the support in the AArch64 target, with the obvious differences (different registers, NOP instructions, etc.). The test cases are adapted from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so you can't use less than that (on AArch64 you can go down to 16). Also, as noted in the docs, we take the patchpoint address to be the actual code address (assuming the call is local in the TOC-sharing sense), which should yield higher performance than generating the full cross-DSO indirect-call sequence and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support is doubly experimental. So go ahead and experiment!
llvm-svn: 225808
show more ...
|
#
b359b735 |
| 09-Jan-2015 |
Hal Finkel <hfinkel@anl.gov> |
[PowerPC] Enable late partial unrolling on the POWER7
The P7 benefits from not have really-small loops so that we either have multiple dispatch groups in the loop and/or the ability to form more-ful
[PowerPC] Enable late partial unrolling on the POWER7
The P7 benefits from not have really-small loops so that we either have multiple dispatch groups in the loop and/or the ability to form more-full dispatch groups during scheduling. Setting the partial unrolling threshold to 44 seems good, empirically, for the P7. Compared to using no late partial unrolling, this yields the following test-suite speedups:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -66.3253% +/- 24.1975% SingleSource/Benchmarks/Misc-C++/oopack_v1p8 -44.0169% +/- 29.4881% SingleSource/Benchmarks/Misc/pi -27.8351% +/- 12.2712% SingleSource/Benchmarks/Stanford/Bubblesort -30.9898% +/- 22.4647%
I've speculatively added a similar setting for the P8. Also, I've noticed that the unroller does not quite calculate the unrolling factor correctly for really tiny loops because it neglects to account for the fact that not every loop body replicant contains an ending branch and counter increment. I'll fix that later.
llvm-svn: 225522
show more ...
|