xref: /llvm-project/openmp/docs/remarks/OMP120.rst (revision dead50d4427cbdd5f41c02c5441270822f702730)
1.. _omp120:
2
3Transformed generic-mode kernel to SPMD-mode [OMP120]
4=====================================================
5
6This optimization remark indicates that the execution strategy for the OpenMP
7target offloading kernel was changed. Generic-mode kernels are executed by a
8single thread that schedules parallel worker threads using a state machine. This
9code transformation can move a kernel that was initially generated in generic
10mode to SPMD-mode where all threads are active at the same time with no state
11machine. This execution strategy is closer to how the threads are actually
12executed on a GPU target. This is only possible if the instructions previously
13executed by a single thread have no side-effects or can be guarded. If the
14instructions have no side-effects they are simply recomputed by each thread.
15
16Generic-mode is often considerably slower than SPMD-mode because of the extra
17overhead required to separately schedule worker threads and pass data between
18them.This optimization allows users to use generic-mode semantics while
19achieving the performance of SPMD-mode. This can be helpful when defining shared
20memory between the threads using :ref:`OMP111 <omp111>`.
21
22Examples
23--------
24
25Normally, any kernel that contains split OpenMP target and parallel regions will
26be executed in generic-mode. Sometimes it is easier to use generic-mode
27semantics to define shared memory, or more tightly control the distribution of
28the threads. This shows a naive matrix-matrix multiplication that contains code
29that will need to be guarded.
30
31.. code-block:: c++
32
33  void matmul(int M, int N, int K, double *A, double *B, double *C) {
34  #pragma omp target teams distribute collapse(2) \
35    map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
36    for (int i = 0; i < M; i++) {
37      for (int j = 0; j < N; j++) {
38        double sum = 0.0;
39
40  #pragma omp parallel for reduction(+:sum) default(firstprivate)
41        for (int k = 0; k < K; k++)
42          sum += A[i*K + k] * B[k*N + j];
43
44        C[i*N + j] = sum;
45      }
46    }
47  }
48
49.. code-block:: console
50
51   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp
52   omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111]
53        double sum = 0.0;
54               ^
55   omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
56   #pragma omp target teams distribute collapse(2) \
57   ^
58
59This requires guarding the store to the shared variable ``sum`` and the store to
60the matrix ``C``. This can be thought of as generating the code below.
61
62.. code-block:: c++
63
64  void matmul(int M, int N, int K, double *A, double *B, double *C) {
65  #pragma omp target teams distribute collapse(2) \
66    map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
67    for (int i = 0; i < M; i++) {
68      for (int j = 0; j < N; j++) {
69      double sum;
70  #pragma omp parallel default(firstprivate) shared(sum)
71      {
72      #pragma omp barrier
73      if (omp_get_thread_num() == 0)
74        sum = 0.0;
75      #pragma omp barrier
76
77  #pragma omp for reduction(+:sum)
78        for (int k = 0; k < K; k++)
79          sum += A[i*K + k] * B[k*N + j];
80
81      #pragma omp barrier
82      if (omp_get_thread_num() == 0)
83        C[i*N + j] = sum;
84      #pragma omp barrier
85      }
86      }
87    }
88  }
89
90
91Diagnostic Scope
92----------------
93
94OpenMP target offloading optimization remark.
95