History log of /llvm-project/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp (Results 176 – 200 of 227)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# b81f1e0d 29-Mar-2016 Adam Nemet <anemet@apple.com>

[PPC] Remove -ppc-loop-prefetch-distance in favor of -prefetch-distance

After the previous change, this can now be overridden centrally in the
pass.

llvm-svn: 264807


# fa7057a4 29-Mar-2016 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Refactor popcnt[dw] target features

Instead of using two feature bits, one to indicate the availability of the
popcnt[dw] instructions, and another to indicate whether or not they're fast,

[PowerPC] Refactor popcnt[dw] target features

Instead of using two feature bits, one to indicate the availability of the
popcnt[dw] instructions, and another to indicate whether or not they're fast,
use a single enum. This allows more consistent control via target attribute
strings, and via Clang's command line.

llvm-svn: 264690

show more ...


# 69ada2f5 28-Mar-2016 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Clarify a comment in PPCTTI about vector loads

This should say that we could do unaligned vector loads on the P7 using VSX
instructions, not that we should.

llvm-svn: 264683


# 7059d416 28-Mar-2016 Hal Finkel <hfinkel@anl.gov>

[PowerPC] On the A2, popcnt[dw] are very slow

The A2 cores support the popcntw/popcntd instructions, but they're microcoded,
and slower than our default software emulation. Specifically, popcnt[dw]

[PowerPC] On the A2, popcnt[dw] are very slow

The A2 cores support the popcntw/popcntd instructions, but they're microcoded,
and slower than our default software emulation. Specifically, popcnt[dw] take
approximately 74 cycles, whereas our software emulation takes only 24-28
cycles.

I've added a new target feature to indicate a slow popcnt[dw], instead of just
removing the existing target feature from the a2/a2q processor models, because:
1. This allows us to return more accurate information via the TTI interface
(I recognize that this currently makes no practical difference)
2. Is hopefully easier to understand (it allows the core's features to match
its manual while still having the desired effect).

llvm-svn: 264600

show more ...


Revision tags: llvmorg-3.8.0, llvmorg-3.8.0-rc3, llvmorg-3.8.0-rc2
# dadfbb52 27-Jan-2016 Adam Nemet <anemet@apple.com>

[TTI] Add getPrefetchDistance from PPCLoopDataPrefetch, NFC

This patch is part of the work to make PPCLoopDataPrefetch
target-independent
(http://thread.gmane.org/gmane.comp.compilers.llvm.devel/927

[TTI] Add getPrefetchDistance from PPCLoopDataPrefetch, NFC

This patch is part of the work to make PPCLoopDataPrefetch
target-independent
(http://thread.gmane.org/gmane.comp.compilers.llvm.devel/92758).

As it was discussed in the above thread, getPrefetchDistance is
currently using instruction count which may change in the future.

llvm-svn: 258995

show more ...


# af761104 21-Jan-2016 Adam Nemet <anemet@apple.com>

[TTI] Add getCacheLineSize

Summary:
And use it in PPCLoopDataPrefetch.cpp.

@hfinkel, please let me know if your preference would be to preserve the
ppc-loop-prefetch-cache-line option in order to b

[TTI] Add getCacheLineSize

Summary:
And use it in PPCLoopDataPrefetch.cpp.

@hfinkel, please let me know if your preference would be to preserve the
ppc-loop-prefetch-cache-line option in order to be able to override the
value of TTI::getCacheLineSize for PPC.

Reviewers: hfinkel

Subscribers: hulx2000, mcrosier, mssimpso, hfinkel, llvm-commits

Differential Revision: http://reviews.llvm.org/D16306

llvm-svn: 258419

show more ...


Revision tags: llvmorg-3.8.0-rc1, llvmorg-3.7.1, llvmorg-3.7.1-rc2, llvmorg-3.7.1-rc1
# 4a7be239 04-Sep-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Enable interleaved-access vectorization

This adds a basic cost model for interleaved-access vectorization (and a better
default for shuffles), and enables interleaved-access vectorization

[PowerPC] Enable interleaved-access vectorization

This adds a basic cost model for interleaved-access vectorization (and a better
default for shuffles), and enables interleaved-access vectorization by default.
The relevant difference from the default cost model for interleaved-access
vectorization, is that on PPC, the shuffles that end up being used are *much*
cheaper than modeling the process with insert/extract pairs (which are
quite expensive, especially on older cores).

llvm-svn: 246824

show more ...


# 75afa2b6 03-Sep-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Always use aggressive interleaving on the A2

On the A2, with an eye toward QPX unaligned-load merging, we should always use
aggressive interleaving. It is generally superior to only using

[PowerPC] Always use aggressive interleaving on the A2

On the A2, with an eye toward QPX unaligned-load merging, we should always use
aggressive interleaving. It is generally superior to only using concatenation
unrolling.

llvm-svn: 246819

show more ...


# f11bc761 03-Sep-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Include the permutation cost for unaligned vector loads

Pre-P8, when we generate code for unaligned vector loads (for Altivec and QPX
types), even when accounting for the combining that ta

[PowerPC] Include the permutation cost for unaligned vector loads

Pre-P8, when we generate code for unaligned vector loads (for Altivec and QPX
types), even when accounting for the combining that takes place for multiple
consecutive such loads, there is at least one load instructions and one
permutation for each load. Make sure the cost reported reflects the cost of the
permutes as well.

llvm-svn: 246807

show more ...


# 79dbf5b5 02-Sep-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Cleanup cost model for unaligned vector loads/stores

I'm adding a regression test to better cover code generation for unaligned
vector loads and stores, but there's no functional change to

[PowerPC] Cleanup cost model for unaligned vector loads/stores

I'm adding a regression test to better cover code generation for unaligned
vector loads and stores, but there's no functional change to the code
generation here. There is an improvement to the cost model for unaligned vector
loads and stores, mostly for QPX (for which we were not previously accounting
for the permutation-based loads), and the cost model implementation is cleaner.

llvm-svn: 246712

show more ...


Revision tags: llvmorg-3.7.0, llvmorg-3.7.0-rc4, llvmorg-3.7.0-rc3, studio-1.4
# 93205eb9 05-Aug-2015 Chandler Carruth <chandlerc@gmail.com>

[TTI] Make the cost APIs in TargetTransformInfo consistently use 'int'
rather than 'unsigned' for their costs.

For something like costs in particular there is a natural "negative"
value, that of sav

[TTI] Make the cost APIs in TargetTransformInfo consistently use 'int'
rather than 'unsigned' for their costs.

For something like costs in particular there is a natural "negative"
value, that of savings or saved cost. As a consequence, there is a lot
of code that subtracts or creates negative values based on cost, all of
which is prone to awkwardness or bugs when dealing with an unsigned
type. Similarly, we *never* want these values to wrap, as that would
cause Very Bad code generation (likely percieved as an infinite loop as
we try to emit over 2^32 instructions or some such insanity).

All around 'int' seems a much better fit for these basic metrics. I've
added asserts to ensure that at least the TTI interface never returns
negative numbers here. If we ever have a use case for negative numbers,
we can remove this, but this way a bug where someone used '-1' to
produce a 'very large' cost will be caught by the assert.

This passes all tests, and is also UBSan clean.

No functional change intended.

Differential Revision: http://reviews.llvm.org/D11741

llvm-svn: 244080

show more ...


Revision tags: llvmorg-3.7.0-rc2, llvmorg-3.7.0-rc1
# 44ede33a 09-Jul-2015 Mehdi Amini <mehdi.amini@apple.com>

Make TargetLowering::getPointerTy() taking DataLayout as an argument

Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the

Make TargetLowering::getPointerTy() taking DataLayout as an argument

Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.

Reviewers: echristo

Subscribers: jholewinski, ted, yaron.keren, rafael, llvm-commits

Differential Revision: http://reviews.llvm.org/D11028

From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 241775

show more ...


Revision tags: llvmorg-3.6.2, llvmorg-3.6.2-rc1
# 3b3c9c3e 21-May-2015 Hal Finkel <hfinkel@anl.gov>

[PPC/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the PPC/A2

On X86 (and similar OOO cores) unrolling is very limited, and even if the
runtime unrolling is otherwise profitable

[PPC/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the PPC/A2

On X86 (and similar OOO cores) unrolling is very limited, and even if the
runtime unrolling is otherwise profitable, the expense of a division to compute
the trip count could greatly outweigh the benefits. On the A2, we unroll a lot,
and the benefits of unrolling are more significant (seeing a 5x or 6x speedup
is not uncommon), so we're more able to tolerate the expense, on average, of a
division to compute the trip count.

llvm-svn: 237947

show more ...


Revision tags: llvmorg-3.6.1, llvmorg-3.6.1-rc1
# 062c7448 06-May-2015 Wei Mi <wmi@google.com>

[X86] Disable loop unrolling in loop vectorization pass when VF is 1.

The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture,
by setting MaxInterleaveFactor to 1. Unr

[X86] Disable loop unrolling in loop vectorization pass when VF is 1.

The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture,
by setting MaxInterleaveFactor to 1. Unrolling in loop vectorization pass may introduce
the cost of overflow check, memory boundary check and extra prologue/epilogue code when
regular unroller will unroll the loop another time. Disable it when VF==1 remove the
unnecessary cost on x86. The same can be done for other platforms after verifying
interleaving/memory bound checking to be not perf critical on those platforms.

Differential Revision: http://reviews.llvm.org/D9515

llvm-svn: 236613

show more ...


Revision tags: llvmorg-3.5.2, llvmorg-3.5.2-rc1
# 049d803c 06-Mar-2015 Olivier Sallenave <ohsallen@us.ibm.com>

Do not restrict interleaved unrolling to small loops, depending on the target.

llvm-svn: 231528


Revision tags: llvmorg-3.6.0
# c93a9a2c 25-Feb-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Add support for the QPX vector instruction set

This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are

[PowerPC] Add support for the QPX vector instruction set

This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).

I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).

The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.

llvm-svn: 230413

show more ...


Revision tags: llvmorg-3.6.0-rc4, llvmorg-3.6.0-rc3
# 05e69157 12-Feb-2015 Olivier Sallenave <ohsallen@us.ibm.com>

Change max interleave factor to 12 for POWER7 and POWER8.

llvm-svn: 228973


# ab5cb36c 01-Feb-2015 Chandler Carruth <chandlerc@gmail.com>

[multiversion] Remove the function parameter from the unrolling
preferences interface on TTI now that all of TTI is per-function.

llvm-svn: 227741


# c956ab66 01-Feb-2015 Chandler Carruth <chandlerc@gmail.com>

[multiversion] Switch the TTI queries from TargetMachine to Subtarget
now that we have a correct and cached subtarget specific to the
function.

Also, finish providing a cached per-function subtarget

[multiversion] Switch the TTI queries from TargetMachine to Subtarget
now that we have a correct and cached subtarget specific to the
function.

Also, finish providing a cached per-function subtarget in the core
LLVMTargetMachine -- that layer hadn't switched over yet.

The only use of the TargetMachine was to re-lookup a subtarget for
a particular function to work around the fact that TTI was immutable.
Now that it is per-function and we haved a cached subtarget, use it.

This still leaves a few interfaces with real warts on them where we were
passing Function objects through the TTI interface. I'll remove these
and clean their usage up in subsequent commits now that this isn't
necessary.

llvm-svn: 227738

show more ...


# 93dcdc47 31-Jan-2015 Chandler Carruth <chandlerc@gmail.com>

[PM] Switch the TargetMachine interface from accepting a pass manager
base which it adds a single analysis pass to, to instead return the type
erased TargetTransformInfo object constructed for that T

[PM] Switch the TargetMachine interface from accepting a pass manager
base which it adds a single analysis pass to, to instead return the type
erased TargetTransformInfo object constructed for that TargetMachine.

This removes all of the pass variants for TTI. There is now a single TTI
*pass* in the Analysis layer. All of the Analysis <-> Target
communication is through the TTI's type erased interface itself. While
the diff is large here, it is nothing more that code motion to make
types available in a header file for use in a different source file
within each target.

I've tried to keep all the doxygen comments and file boilerplate in line
with this move, but let me know if I missed anything.

With this in place, the next step to making TTI work with the new pass
manager is to introduce a really simple new-style analysis that produces
a TTI object via a callback into this routine on the target machine.
Once we have that, we'll have the building blocks necessary to accept
a function argument as well.

llvm-svn: 227685

show more ...


# 705b185f 31-Jan-2015 Chandler Carruth <chandlerc@gmail.com>

[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.

The end result is that the TTI

[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.

The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.

I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.

There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.

The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.

Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.

The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]

Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:

1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.

Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.

Differential Revision: http://reviews.llvm.org/D7293

llvm-svn: 227669

show more ...


Revision tags: llvmorg-3.6.0-rc2, llvmorg-3.6.0-rc1
# 934361a4 14-Jan-2015 Hal Finkel <hfinkel@anl.gov>

Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""

This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDN

Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""

This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.

Original commit message:

This commit does two things:

1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).

2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.

One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).

StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!

llvm-svn: 225909

show more ...


# 63fb9281 13-Jan-2015 Hal Finkel <hfinkel@anl.gov>

Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support"

Reverting this while I investiage buildbot failures (segfaulting in
GetCostForDef at ScheduleDAGRRList.cpp:314).

llvm-svn: 225811


# 821befd5 13-Jan-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Add StackMap/PatchPoint support

This commit does two things:

1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this

[PowerPC] Add StackMap/PatchPoint support

This commit does two things:

1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).

2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.

One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).

StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!

llvm-svn: 225808

show more ...


# b359b735 09-Jan-2015 Hal Finkel <hfinkel@anl.gov>

[PowerPC] Enable late partial unrolling on the POWER7

The P7 benefits from not have really-small loops so that we either have
multiple dispatch groups in the loop and/or the ability to form more-ful

[PowerPC] Enable late partial unrolling on the POWER7

The P7 benefits from not have really-small loops so that we either have
multiple dispatch groups in the loop and/or the ability to form more-full
dispatch groups during scheduling. Setting the partial unrolling threshold to
44 seems good, empirically, for the P7. Compared to using no late partial
unrolling, this yields the following test-suite speedups:

SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
-66.3253% +/- 24.1975%
SingleSource/Benchmarks/Misc-C++/oopack_v1p8
-44.0169% +/- 29.4881%
SingleSource/Benchmarks/Misc/pi
-27.8351% +/- 12.2712%
SingleSource/Benchmarks/Stanford/Bubblesort
-30.9898% +/- 22.4647%

I've speculatively added a similar setting for the P8. Also, I've noticed that
the unroller does not quite calculate the unrolling factor correctly for really
tiny loops because it neglects to account for the fact that not every loop body
replicant contains an ending branch and counter increment. I'll fix that later.

llvm-svn: 225522

show more ...


12345678910