1*c9f3240cSJoseph Huber========================== 230e818dbSJohannes DoerfertOpenMP-Aware Optimizations 330e818dbSJohannes Doerfert========================== 430e818dbSJohannes Doerfert 5*c9f3240cSJoseph HuberLLVM, since `version 11 <https://releases.llvm.org/download.html#11.0.0>`_ (12 6*c9f3240cSJoseph HuberOct 2020), supports an :ref:`OpenMP-Aware optimization pass <OpenMPOpt>`. This 7*c9f3240cSJoseph Huberoptimization pass will attempt to optimize the module with OpenMP-specific 8*c9f3240cSJoseph Huberdomain-knowledge. This pass is enabled by default at high optimization levels 9*c9f3240cSJoseph Huber(O2 / O3) if compiling with OpenMP support enabled. 10227c8ff1SJohannes Doerfert 11*c9f3240cSJoseph Huber.. _OpenMPOpt: 12*c9f3240cSJoseph Huber 13*c9f3240cSJoseph HuberOpenMPOpt 14*c9f3240cSJoseph Huber========= 15*c9f3240cSJoseph Huber 16*c9f3240cSJoseph Huber.. contents:: 17*c9f3240cSJoseph Huber :local: 18*c9f3240cSJoseph Huber :depth: 1 19*c9f3240cSJoseph Huber 20*c9f3240cSJoseph HuberOpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on 21*c9f3240cSJoseph Huberthe entire Module, and later on the entire call graph. Most optimizations done 22*c9f3240cSJoseph Huberby OpenMPOpt support remarks. Optimization remarks can be enabled by compiling 23*c9f3240cSJoseph Huberwith the following flags. 24*c9f3240cSJoseph Huber 25*c9f3240cSJoseph Huber.. code-block:: console 26*c9f3240cSJoseph Huber 27*c9f3240cSJoseph Huber $ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt 28*c9f3240cSJoseph Huber 29*c9f3240cSJoseph HuberOpenMP Runtime Call Deduplication 30*c9f3240cSJoseph Huber--------------------------------- 31*c9f3240cSJoseph Huber 32*c9f3240cSJoseph HuberThe OpenMP runtime library contains several functions used to implement features 33*c9f3240cSJoseph Huberof the OpenMP standard. Several of the runtime calls are constant within a 34*c9f3240cSJoseph Huberparallel region. A common optimization is to replace invariant code with a 35*c9f3240cSJoseph Hubersingle reference, but in this case the compiler will only see an opaque call 36*c9f3240cSJoseph Huberinto the runtime library. To get around this, OpenMPOpt maintains a list of 37*c9f3240cSJoseph HuberOpenMP runtime functions that are constant and will manually deduplicate them. 38*c9f3240cSJoseph Huber 39*c9f3240cSJoseph HuberGlobalization 40*c9f3240cSJoseph Huber------------- 41*c9f3240cSJoseph Huber 42*c9f3240cSJoseph HuberThe OpenMP standard requires that data can be shared between different threads. 43*c9f3240cSJoseph HuberThis requirement poses a unique challenge when offloading to GPU accelerators. 44*c9f3240cSJoseph HuberData cannot be shared between the threads in a GPU by default, in order to do 45*c9f3240cSJoseph Huberthis it must either be placed in global or shared memory. This needs to be done 46*c9f3240cSJoseph Huberevery time a variable may potentially be shared in order to create correct 47*c9f3240cSJoseph HuberOpenMP programs. Unfortunately, this has significant performance implications 48*c9f3240cSJoseph Huberand is not needed in the majority of cases. For example, when Clang is 49*c9f3240cSJoseph Hubergenerating code for this offloading region, it will see that the variable `x` 50*c9f3240cSJoseph Huberescapes and is potentially shared. This will require globalizing the variable, 51*c9f3240cSJoseph Huberwhich means it cannot reside in the registers on the device. 52*c9f3240cSJoseph Huber 53*c9f3240cSJoseph Huber.. code-block:: c++ 54*c9f3240cSJoseph Huber 55*c9f3240cSJoseph Huber void use(void *) { } 56*c9f3240cSJoseph Huber 57*c9f3240cSJoseph Huber void foo() { 58*c9f3240cSJoseph Huber int x; 59*c9f3240cSJoseph Huber use(&x); 60*c9f3240cSJoseph Huber } 61*c9f3240cSJoseph Huber 62*c9f3240cSJoseph Huber int main() { 63*c9f3240cSJoseph Huber #pragma omp target parallel 64*c9f3240cSJoseph Huber foo(); 65*c9f3240cSJoseph Huber } 66*c9f3240cSJoseph Huber 67*c9f3240cSJoseph HuberIn many cases, this transformation is not actually necessary but still carries a 68*c9f3240cSJoseph Hubersignificant performance penalty. Because of this, OpenMPOpt can perform and 69*c9f3240cSJoseph Huberinter-procedural optimization and scan each known usage of the globalized 70*c9f3240cSJoseph Hubervariable and determine if it is potentially captured and shared by another 71*c9f3240cSJoseph Huberthread. If it is not actually captured, it can safely be moved back to fast 72*c9f3240cSJoseph Huberregister memory. 73*c9f3240cSJoseph Huber 74*c9f3240cSJoseph HuberAnother case is memory that is intentionally shared between the threads, but is 75*c9f3240cSJoseph Hubershared from one thread to all the others. Such variables can be moved to shared 76*c9f3240cSJoseph Hubermemory when compiled without needing to go through the runtime library. This 77*c9f3240cSJoseph Huberallows for users to confidently declare shared memory on the device without 78*c9f3240cSJoseph Huberneeding to use custom OpenMP allocators or rely on the runtime. 79*c9f3240cSJoseph Huber 80*c9f3240cSJoseph Huber 81*c9f3240cSJoseph Huber.. code-block:: c++ 82*c9f3240cSJoseph Huber 83*c9f3240cSJoseph Huber static void share(void *); 84*c9f3240cSJoseph Huber 85*c9f3240cSJoseph Huber static void foo() { 86*c9f3240cSJoseph Huber int x[64]; 87*c9f3240cSJoseph Huber #pragma omp parallel 88*c9f3240cSJoseph Huber share(x); 89*c9f3240cSJoseph Huber } 90*c9f3240cSJoseph Huber 91*c9f3240cSJoseph Huber int main() { 92*c9f3240cSJoseph Huber #pragma omp target 93*c9f3240cSJoseph Huber foo(); 94*c9f3240cSJoseph Huber } 95*c9f3240cSJoseph Huber 96*c9f3240cSJoseph HuberThese optimizations can have very large performance implications. Both of these 97*c9f3240cSJoseph Huberoptimizations rely heavily on inter-procedural analysis. Because of this, 98*c9f3240cSJoseph Huberoffloading applications should ideally be contained in a single translation unit 99*c9f3240cSJoseph Huberand functions should not be externally visible unless needed. OpenMPOpt will 100*c9f3240cSJoseph Huberinform the user if any globalization calls remain if remarks are enabled. This 101*c9f3240cSJoseph Hubershould be treated as a defect in the program. 102227c8ff1SJohannes Doerfert 103227c8ff1SJohannes DoerfertResources 104*c9f3240cSJoseph Huber========= 105227c8ff1SJohannes Doerfert 106*c9f3240cSJoseph Huber- 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4 107227c8ff1SJohannes Doerfert- 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU 108227c8ff1SJohannes Doerfert- 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30 109