1.. _omp120: 2 3Transformed generic-mode kernel to SPMD-mode [OMP120] 4===================================================== 5 6This optimization remark indicates that the execution strategy for the OpenMP 7target offloading kernel was changed. Generic-mode kernels are executed by a 8single thread that schedules parallel worker threads using a state machine. This 9code transformation can move a kernel that was initially generated in generic 10mode to SPMD-mode where all threads are active at the same time with no state 11machine. This execution strategy is closer to how the threads are actually 12executed on a GPU target. This is only possible if the instructions previously 13executed by a single thread have no side-effects or can be guarded. If the 14instructions have no side-effects they are simply recomputed by each thread. 15 16Generic-mode is often considerably slower than SPMD-mode because of the extra 17overhead required to separately schedule worker threads and pass data between 18them.This optimization allows users to use generic-mode semantics while 19achieving the performance of SPMD-mode. This can be helpful when defining shared 20memory between the threads using :ref:`OMP111 <omp111>`. 21 22Examples 23-------- 24 25Normally, any kernel that contains split OpenMP target and parallel regions will 26be executed in generic-mode. Sometimes it is easier to use generic-mode 27semantics to define shared memory, or more tightly control the distribution of 28the threads. This shows a naive matrix-matrix multiplication that contains code 29that will need to be guarded. 30 31.. code-block:: c++ 32 33 void matmul(int M, int N, int K, double *A, double *B, double *C) { 34 #pragma omp target teams distribute collapse(2) \ 35 map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N]) 36 for (int i = 0; i < M; i++) { 37 for (int j = 0; j < N; j++) { 38 double sum = 0.0; 39 40 #pragma omp parallel for reduction(+:sum) default(firstprivate) 41 for (int k = 0; k < K; k++) 42 sum += A[i*K + k] * B[k*N + j]; 43 44 C[i*N + j] = sum; 45 } 46 } 47 } 48 49.. code-block:: console 50 51 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp 52 omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111] 53 double sum = 0.0; 54 ^ 55 omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120] 56 #pragma omp target teams distribute collapse(2) \ 57 ^ 58 59This requires guarding the store to the shared variable ``sum`` and the store to 60the matrix ``C``. This can be thought of as generating the code below. 61 62.. code-block:: c++ 63 64 void matmul(int M, int N, int K, double *A, double *B, double *C) { 65 #pragma omp target teams distribute collapse(2) \ 66 map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N]) 67 for (int i = 0; i < M; i++) { 68 for (int j = 0; j < N; j++) { 69 double sum; 70 #pragma omp parallel default(firstprivate) shared(sum) 71 { 72 #pragma omp barrier 73 if (omp_get_thread_num() == 0) 74 sum = 0.0; 75 #pragma omp barrier 76 77 #pragma omp for reduction(+:sum) 78 for (int k = 0; k < K; k++) 79 sum += A[i*K + k] * B[k*N + j]; 80 81 #pragma omp barrier 82 if (omp_get_thread_num() == 0) 83 C[i*N + j] = sum; 84 #pragma omp barrier 85 } 86 } 87 } 88 } 89 90 91Diagnostic Scope 92---------------- 93 94OpenMP target offloading optimization remark. 95