xref: /llvm-project/openmp/docs/optimizations/OpenMPOpt.rst (revision 0c660256eb41fb0ba44277a32f39d2a028f797f2)
1*c9f3240cSJoseph Huber==========================
230e818dbSJohannes DoerfertOpenMP-Aware Optimizations
330e818dbSJohannes Doerfert==========================
430e818dbSJohannes Doerfert
5*c9f3240cSJoseph HuberLLVM, since `version 11 <https://releases.llvm.org/download.html#11.0.0>`_ (12
6*c9f3240cSJoseph HuberOct 2020), supports an :ref:`OpenMP-Aware optimization pass <OpenMPOpt>`. This
7*c9f3240cSJoseph Huberoptimization pass will attempt to optimize the module with OpenMP-specific
8*c9f3240cSJoseph Huberdomain-knowledge. This pass is enabled by default at high optimization levels
9*c9f3240cSJoseph Huber(O2 / O3) if compiling with OpenMP support enabled.
10227c8ff1SJohannes Doerfert
11*c9f3240cSJoseph Huber.. _OpenMPOpt:
12*c9f3240cSJoseph Huber
13*c9f3240cSJoseph HuberOpenMPOpt
14*c9f3240cSJoseph Huber=========
15*c9f3240cSJoseph Huber
16*c9f3240cSJoseph Huber.. contents::
17*c9f3240cSJoseph Huber   :local:
18*c9f3240cSJoseph Huber   :depth: 1
19*c9f3240cSJoseph Huber
20*c9f3240cSJoseph HuberOpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on
21*c9f3240cSJoseph Huberthe entire Module, and later on the entire call graph. Most optimizations done
22*c9f3240cSJoseph Huberby OpenMPOpt support remarks. Optimization remarks can be enabled by compiling
23*c9f3240cSJoseph Huberwith the following flags.
24*c9f3240cSJoseph Huber
25*c9f3240cSJoseph Huber.. code-block:: console
26*c9f3240cSJoseph Huber
27*c9f3240cSJoseph Huber  $ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt
28*c9f3240cSJoseph Huber
29*c9f3240cSJoseph HuberOpenMP Runtime Call Deduplication
30*c9f3240cSJoseph Huber---------------------------------
31*c9f3240cSJoseph Huber
32*c9f3240cSJoseph HuberThe OpenMP runtime library contains several functions used to implement features
33*c9f3240cSJoseph Huberof the OpenMP standard. Several of the runtime calls are constant within a
34*c9f3240cSJoseph Huberparallel region. A common optimization is to replace invariant code with a
35*c9f3240cSJoseph Hubersingle reference, but in this case the compiler will only see an opaque call
36*c9f3240cSJoseph Huberinto the runtime library. To get around this, OpenMPOpt maintains a list of
37*c9f3240cSJoseph HuberOpenMP runtime functions that are constant and will manually deduplicate them.
38*c9f3240cSJoseph Huber
39*c9f3240cSJoseph HuberGlobalization
40*c9f3240cSJoseph Huber-------------
41*c9f3240cSJoseph Huber
42*c9f3240cSJoseph HuberThe OpenMP standard requires that data can be shared between different threads.
43*c9f3240cSJoseph HuberThis requirement poses a unique challenge when offloading to GPU accelerators.
44*c9f3240cSJoseph HuberData cannot be shared between the threads in a GPU by default, in order to do
45*c9f3240cSJoseph Huberthis it must either be placed in global or shared memory. This needs to be done
46*c9f3240cSJoseph Huberevery time a variable may potentially be shared in order to create correct
47*c9f3240cSJoseph HuberOpenMP programs. Unfortunately, this has significant performance implications
48*c9f3240cSJoseph Huberand is not needed in the majority of cases. For example, when Clang is
49*c9f3240cSJoseph Hubergenerating code for this offloading region, it will see that the variable `x`
50*c9f3240cSJoseph Huberescapes and is potentially shared. This will require globalizing the variable,
51*c9f3240cSJoseph Huberwhich means it cannot reside in the registers on the device.
52*c9f3240cSJoseph Huber
53*c9f3240cSJoseph Huber.. code-block:: c++
54*c9f3240cSJoseph Huber
55*c9f3240cSJoseph Huber  void use(void *) { }
56*c9f3240cSJoseph Huber
57*c9f3240cSJoseph Huber  void foo() {
58*c9f3240cSJoseph Huber    int x;
59*c9f3240cSJoseph Huber    use(&x);
60*c9f3240cSJoseph Huber  }
61*c9f3240cSJoseph Huber
62*c9f3240cSJoseph Huber  int main() {
63*c9f3240cSJoseph Huber  #pragma omp target parallel
64*c9f3240cSJoseph Huber    foo();
65*c9f3240cSJoseph Huber  }
66*c9f3240cSJoseph Huber
67*c9f3240cSJoseph HuberIn many cases, this transformation is not actually necessary but still carries a
68*c9f3240cSJoseph Hubersignificant performance penalty. Because of this, OpenMPOpt can perform and
69*c9f3240cSJoseph Huberinter-procedural optimization and scan each known usage of the globalized
70*c9f3240cSJoseph Hubervariable and determine if it is potentially captured and shared by another
71*c9f3240cSJoseph Huberthread. If it is not actually captured, it can safely be moved back to fast
72*c9f3240cSJoseph Huberregister memory.
73*c9f3240cSJoseph Huber
74*c9f3240cSJoseph HuberAnother case is memory that is intentionally shared between the threads, but is
75*c9f3240cSJoseph Hubershared from one thread to all the others. Such variables can be moved to shared
76*c9f3240cSJoseph Hubermemory when compiled without needing to go through the runtime library.  This
77*c9f3240cSJoseph Huberallows for users to confidently declare shared memory on the device without
78*c9f3240cSJoseph Huberneeding to use custom OpenMP allocators or rely on the runtime.
79*c9f3240cSJoseph Huber
80*c9f3240cSJoseph Huber
81*c9f3240cSJoseph Huber.. code-block:: c++
82*c9f3240cSJoseph Huber
83*c9f3240cSJoseph Huber  static void share(void *);
84*c9f3240cSJoseph Huber
85*c9f3240cSJoseph Huber  static void foo() {
86*c9f3240cSJoseph Huber    int x[64];
87*c9f3240cSJoseph Huber  #pragma omp parallel
88*c9f3240cSJoseph Huber    share(x);
89*c9f3240cSJoseph Huber  }
90*c9f3240cSJoseph Huber
91*c9f3240cSJoseph Huber  int main() {
92*c9f3240cSJoseph Huber    #pragma omp target
93*c9f3240cSJoseph Huber    foo();
94*c9f3240cSJoseph Huber  }
95*c9f3240cSJoseph Huber
96*c9f3240cSJoseph HuberThese optimizations can have very large performance implications. Both of these
97*c9f3240cSJoseph Huberoptimizations rely heavily on inter-procedural analysis. Because of this,
98*c9f3240cSJoseph Huberoffloading applications should ideally be contained in a single translation unit
99*c9f3240cSJoseph Huberand functions should not be externally visible unless needed. OpenMPOpt will
100*c9f3240cSJoseph Huberinform the user if any globalization calls remain if remarks are enabled. This
101*c9f3240cSJoseph Hubershould be treated as a defect in the program.
102227c8ff1SJohannes Doerfert
103227c8ff1SJohannes DoerfertResources
104*c9f3240cSJoseph Huber=========
105227c8ff1SJohannes Doerfert
106*c9f3240cSJoseph Huber- 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4
107227c8ff1SJohannes Doerfert- 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU
108227c8ff1SJohannes Doerfert- 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30
109