xref: /llvm-project/flang/docs/OpenACC-descriptor-management.md (revision 12ba74e181bd6641b532e271f3bfabf53066b1c0)
1<!--===- docs/OpenACC-descriptor-management.md
2
3   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4   See https://llvm.org/LICENSE.txt for license information.
5   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6
7-->
8
9# OpenACC dialect: Fortran descriptor management in the offload data environment
10
11## Overview
12
13This document describes the details of the Fortran descriptor management during the offload execution using OpenACC.
14
15LLVM Flang compiler uses an extended CFI descriptor data structure to represent some Fortran variables along with their characteristics in memory.  For example, the descriptor is used to access dummy assumed shape arrays in a context of a subroutine where the arrays' bounds are not explicit but are rather passed to the subroutine via the descriptor storage.
16
17During the offload execution a variable data (i.e. the memory holding the actual value of the variable) can be transferred between the host and the device using explicit OpenACC constructs.  Fortran language does not expose the descriptor representation to the user, but the accesses of variables with descriptors on the device may still be done using the descriptor data.  Thus, the implementations must implicitly manage the data transfers of the descriptors along with the transfers of the actual values of the variables.
18
19The MLIR OpenACC dialect is language agnostic so that it can work for Fortran and C/C++ with OpenACC.  The dialect should provide means for expressing the logic for data and descriptor management for the offload data environment.
20
21The chapter numbering in this document refers to:
22
23* F202x: Fortran standard J3/23-007r1
24* OpenACC: OpenACC specification version 3.3
25* OpenMP: OpenMP specification version 5.2
26
27## CFI descriptor structure
28
29Flang represents the data descriptors in memory using `CFI_cdesc_t` layout specified by F202x 18.5.3, i.e. the variable data address is located in its first member `base_addr`.  The standard does not strictly specify the layout of all the members, moreover, the actual size of the structure may be different for different variables (e.g. due to different ranks).  Other compilers may use different data descriptor formats, e.g. the variable data address may be interleaved with the auxiliary members, or the auxiliary data may be operated as a data structure not containing the variable data address.  In this document we will only consider supporting the CFI descriptor format as supported by Flang.
30
31## Runtime behavior for variables with descriptors
32
33### Pointer variables
34
35OpenACC specifies the pointer attachment behavior in OpenACC 2.6.8.  This paragraph applies both to Fortran and C/C++ pointers, and the Fortran specific representation of the POINTER variables representation is not explicitly specified.  There is a single mention of the "Fortran descriptor (dope vector)" in 2.6.4 with regards to the POINTER/ALLOCATABLE members of data structures.  Chapter 2.7.2 describes the behavior of different data clause actions including `attach` and `detach` actions for the pointers.  Finally, chapters 2.7.4-13 describe what actions are taken for each OpenACC data clause.
36
37The spec operates in terms of the attachment counter associated with each pointer in device memory.  This counter is not exposed to the user code explicitly, i.e. there is no standard way to query the attachment counter for a pointer, but there are `acc_attach*` and `acc_detach*` APIs that affect the attachment counter as well as the data clause actions used with OpenACC constructs.
38
39Here is an example to demonstrate the member pointer attachment:
40
41Example:
42
43```Fortran
44module types
45  type ty1
46     real, pointer :: p(:,:)
47  end type ty1
48end module types
49program main
50  use types
51  use openacc
52  type(ty1) :: d
53  real, pointer :: t1(:,:)
54  nullify(d%p)
55  allocate(t1(2,2))
56
57  ! 2.7.9:
58  ! Allocates the memory for object 'd' on the device.
59  ! The descriptor member of 'd' is a NULL descriptor,
60  ! i.e. the host contents of the descriptor is copied
61  ! verbatim to the device.
62  ! The descriptor storage is created on the device
63  ! just as part of the object 'd' storage.
64  !$acc enter data create(d)
65
66  d%p => t1
67
68  ! 2.7.7:
69  ! Pointer d%p is not present on the device, so copyin
70  ! action is performed for the data pointed to by the pointer:
71  ! the memory for 'REAL :: (2,2)' array is allocated on the device
72  ! and the host values of the array elements are copied to
73  ! the allocated device memory.
74  ! Then the attach action is performed, i.e. the contents
75  ! of the device descriptor of d%p are updated as:
76  !   * The base_addr member of the descriptor on the device
77  !     is initialized to the device address of the data
78  !     that has been initialized during the copyin.
79  !   * The auxiliary members of the device descriptor are initialized
80  !     from the host values of the corresponding members.
81  !   * The attachment counter of 'd%p' is set to 1.
82  !$acc enter data copyin(d%p)
83
84  ! 2.7.7:
85  ! Pointer d%p is already present on the device, so no copyin
86  ! action is performed.
87  ! The attach action is performed according to 2.6.8:
88  ! since the pointer is associated with the same target as
89  ! during the previous attachment, only its attachment counter
90  ! is incremented to 2.
91  !$acc enter data copyin(d%p)
92
93  ! 3.2.29:
94  ! The detach action is performed. According to 2.7.2 the attachment
95  ! counter of d%p is decremented to 1.
96  call acc_detach(d%p)
97
98  ! 3.2.29:
99  ! The detach action is performed. According to 2.7.2 the attachment
100  ! counter of d%p is decremented to 0, which initiates an update
101  ! of the the device pointer to have the same value as the corresponding
102  ! pointer in the local memory.
103  ! We will discuss this in more detail below.
104  call acc_detach(d%p)
105
106  ! The following construct will fail, because the 'd%p' descriptor's
107  ! base_addr is now the host address not accessible on the device.
108  ! Without the second 'acc_detach' it will work correctly.
109  !$acc serial present(d)
110  print *, d%p(1,2)
111  !$acc end serial
112```
113
114Let's discuss in more detail what happens during the second `acc_detach`.
115
116OpenACC 2.6.4:
117
118> 1360 An attach action updates the pointer in device memory to point to the device copy of the data
119> 1361 that the host pointer targets; see Section 2.7.2. For Fortran array pointers and allocatable arrays,
120> 1362 this includes copying any associated descriptor (dope vector) to the device copy of the pointer.
121> 1363 When the device pointer target is deallocated, the pointer in device memory should be restored
122> 1364 to the host value, so it can be safely copied back to host memory. A detach action updates the
123> 1365 pointer in device memory to have the same value as the corresponding pointer in local memory;
124> 1366 see Section 2.7.2.
125
126It explicitly says that the associated descriptor copy happens during the attach action, but it does not specify the same for the detach action.  So one interpretation of this could be that only the `base_addr` member is updated, but this would allow chimera descriptors in codes like this:
127
128Example:
129
130```Fortran
131  !$acc enter data copyin(d)
132  d%p => t1
133  !$acc enter data copyin(d%p)
134  d%p(10:,10:) => d%p
135  call acc_detach(d%p)
136  !$acc exit data copyout(d)
137  print *, lbound(d%p)
138```
139
140At the point of `acc_detach` the host descriptor of `d%p` points to `t1` data and has the lower bounds `(10:, 10:)`, so if the detach action only updates the `base_addr` member of the device descriptor and does not update the auxiliary members from their current host values, then during the `copyout(d)` the host descriptor `d%p` will have the stale lower bounds `(2:, 2:)`.
141
142So the proposed reading of the spec here is that the same note about the descriptor applies equally to the attach and the detach actions.
143
144#### "Moving target"
145
146According to OpenACC 2.6.8:
147
148> 1535 when the pointer is allocated in device memory. **The attachment counter for a pointer is set to one**
149> **1536 whenever the pointer is attached to new target address**, and incremented whenever an attach action
150> 1537 for that pointer is performed for the same target address.
151
152This clearly applies to the following example, where the second attach action is executed while the host pointer is attached to new target address comparing to the first attach action:
153
154Example:
155
156```Fortran
157  !$acc enter data copyin(d)
158  !$acc enter data copyin(t1, t2)
159  d%p => t1
160  !$acc enter data attach(d%p)
161  d%p => t2
162  !$acc enter data attach(d%p)
163```
164
165The spec is not too explicit about the following example, though:
166
167Example:
168
169```Fortran
170  !$acc enter data copyin(d)
171  !$acc enter data copyin(t1, t2)
172  d%p => t1
173  !$acc enter data attach(d%p)
174  d%p(10:,10:) => d%p
175  !$acc enter data attach(d%p)
176```
177
178The `d%p` pointer has not been attached to **new target address** between the two attach actions, so the device descriptor update is not strictly required, but the proposed spec reading is to execute the update as if the pointer has changed its association between the two attach actions (i.e. the attachment counter of `d%p` is 1 after the second attach, not 2; and the device descriptor is updated with the current values).
179
180In other words, the Fortran pointer in the context of the attach actions should be considered not just as the target address but as a combination of all the values embedded inside its Fortran descriptor.
181
182#### Pointer dummy arguments
183
184All the same rules apply to the pointer dummy arguments even though OpenACC spec, again, mentions the descriptor copying only in the context of member pointers.  The following test demonstrates that a consistent implementation should apply 2.6.4 for pointer dummy arguments, otherwise, OpenACC programs may exhibit unexpected behavior after seemingly straightforward subroutine inlining/outlining:
185
186Example:
187
188```Fortran
189  type(ty1) :: d
190  real, pointer :: t1(:,:)
191  allocate(t1(2,2))
192  call wrapper(d, d%p, t1)
193contains
194  subroutine wrapper(d, p, t1)
195    type(ty1), target :: d
196    real, pointer :: p(:,:)
197    real, pointer :: t1(:,:)
198    !$acc enter data copyin(d)
199    !$acc enter data copyin(t1)
200    p => t1
201    !$acc enter data copyin(p)
202    !$acc serial present(d)
203    print *, d%p(1,2)
204    !$acc end serial
205```
206
207If the descriptor contents is not copied during the attach action implied by `copyin(p)`, then this code does not behave the same way as:
208
209```Fortran
210  type(ty1) :: d
211  real, pointer :: t1(:,:)
212  allocate(t1(2,2))
213  !$acc enter data copyin(d)
214  !$acc enter data copyin(t1)
215  d%p => t1
216  !$acc enter data copyin(d%p)
217  !$acc serial present(d)
218  print *, d%p(1,2)
219  !$acc end serial
220  !call wrapper(d, d%p, t1)
221```
222
223#### The descriptor storage allocation
224
225For POINTER members of aggregate variables the descriptor storage is allocated on the device as part of the allocation of the aggregate variable, which is done either explicitly in the user code or implicitly due to 2.6.2:
226
227> 1292 On a compute or combined construct, if a variable appears in a reduction clause but no other
228> 1293 data clause, it is treated as if it also appears in a copy clause. Otherwise, for any variable, the
229> 1294 compiler will implicitly determine its data attribute on a compute construct if all of the following
230> 1295 conditions are met:
231> ...
232> 1299 An aggregate variable will be treated as if it appears either:
233> 1300 • In a present clause if there is a default(present) clause visible at the compute con
234> 1301 struct.
235> 1302 • In a copy clause otherwise.
236
237Example:
238
239```Fortran
240  type(ty1) :: d
241  real, target :: t1(2,2)
242  d%p => t1
243  !$acc enter data copyin(d%p)
244  !$acc serial present(d%p)
245  print *, d%p(1,2)
246  !$acc end serial
247```
248
249Due to `d%p` reference in the `present` clause of the `serial` region, the compiler must produce an implicit copy of `d`. In order for the `d%p` pointer attachment to happen the descriptor storage must be created before the attachment happens, so the following order of the clauses must be implied:
250
251```Fortran
252  !$acc serial copy(d) present(d%p)
253```
254
255In the case of POINTER dummy argument, if the descriptor storage is not explicitly created in the user code, the pointer attachment may not happen due to 2.7.2:
256
257> 1693 If the pointer var is in shared memory or is not present in the current device memory, or if the
258> 1694 address to which var points is not present in the current device memory, no action is taken.
259
260Example:
261
262```Fortran
263  d%p => t1
264  !$acc enter data copyin(t1)
265  call wrapper(d%p)
266contains
267  subroutine wrapper(p)
268    real, pointer :: p(:,:)
269    !$acc serial attach(p)
270    print *, p(1,2)
271    !$acc end serial
272```
273
274### Allocatable variables
275
276OpenACC 2.6.4 names both POINTER and ALLOCATABLE members of data structures as *pointer*, so the same attachment rules apply to both, including the case of dummy ALLOCATABLE arguments:
277
278Example:
279
280```Fortran
281module types
282  type ty2
283     real, allocatable :: a(:,:)
284  end type ty2
285end module types
286  use types
287  type(ty2), target :: dd
288  dd%a = reshape((/1,2,3,4/),(/2,2/))
289  call wrapper(dd, dd%a)
290contains
291  subroutine wrapper(dd, a)
292    type(ty2), target :: dd
293    real, allocatable, target :: a(:,:)
294    !$acc enter data copyin(dd)
295    !$acc enter data copyin(a)
296    !$acc serial present(dd)
297    print *, dd%a(1,2)
298    !$acc end serial
299```
300
301### Other variables
302
303F18 compiler also uses descriptors for assumed-shape, assumed-rank, polymorphic, ... variables.  The OpenACC specification does not prescribe how an implementation should manage the descriptors for such variables.  In many (all?) cases the descriptors of these variables have a local scope of a single subprogram, and if a descriptor of such a variable is created on the device, then its live range must be limited on the device by the invocation of the subprogram (with any OpenACC constructs inside it).
304
305For example:
306
307```Fortran
308  type(ty2), target :: dd
309  ...
310  call wrapper(dd%a)
311contains
312  subroutine wrapper(a)
313    real :: a(10:,10:)
314    !$acc serial copyin(a)
315    print *, a(10,11)
316    !$acc end serial
317```
318
319The dummy assumed-shape argument `a` is represented with a descriptor, which has no storage overlap with `dd%a`, i.e. it is a temporary descriptor created to represent the data `dd%a` in a shape according to the declaration of the dummy argument `a`.  The implementation is not strictly required to transfer all the values embedded inside the descriptor for `a` to the device.  The only required actions for this code are the ones prescribed by the `copyin(a)` clause in 2.7.7.
320
321### Summary
322
323Pointer attachment for POINTER and ALLOCATABLE variables is a "composite" runtime action that involves the following:
324
325* Getting the device address corresponding to the device copy of the descriptor.
326* Comparing the current host descriptor contents with the device descriptor contents (for proper attachment counter updates).
327* Getting the device address corresponding to the device copy of the data pointed to by the descriptor.
328* Copying data from the host to the device to update the device copy of the descriptor: this data may include the device address of the data, the descriptor data describing the element size, dimensions, etc.
329* Descriptors with an F18 addendum may also require mapping the data pointed to by the addendum pointer(s) and attaching this pointer(s) into the device copy of the descriptor.
330
331## Representing pointer attachment in MLIR OpenACC dialect
332
333The Fortran pointer attachment logic specified by OpenACC is not trivial, and in order to be expressed in a language independent MLIR OpenACC dialect we propose to use recipes for delegating the complexity of the implementation to F18 runtime.
334
335```Fortran
336  !$acc enter data attach(d%p)
337```
338
339The frontend generates an `acc.attach` data operation with `augPtr` being an address of the F18 descriptor representing a POINTER/ALLOCATABLE variable.  Note that `augPtr` refers to an abstract augmented pointer structure, which is handled in a language specific manner by the code provided by the `attachRecipe` reference.
340
341The `attachRecipe` is a callback that takes `varPtr` and `augPtr` pointers, and the section's `offset` and `size` computed from the `bounds` operand of `acc.attach`.  Fortran FE passes these arguments directly to F18 runtime that is aware of the descriptor structure and does all the required checks and device memory updates for the device copy of the descriptor, including the attachment counters updates.
342
343```
344acc.attach.recipe @attach_ref :
345    (!fir.ref<none>, !fir.ref<!fir.box<none>>, index, index) {
346^bb0(%base_addr_val : !fir.ref<none>,
347     %aug_ptr : !fir.ref<!fir.box<none>>,
348     %offset : index,
349     %size : index):
350  fir.call _FortranAOpenACCAttachDescriptor(%aug_ptr, %base_addr_val, %offset, %size) :
351      (!fir.ref<none>, !fir.ref<!fir.box<none>>, index, index) -> ()
352  acc.yield
353}
354
355%descref = hlfir.designate %0#0{"p"}
356    {fortran_attrs = #fir.var_attrs<pointer>} :
357    (!fir.ref<!fir.type<_QMtypesTty1{p:!fir.box<!fir.ptr<!fir.array<?x?xf32>>>}>>) ->
358    !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>>
359%descval = fir.load %descref : !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>>
360%base_addr_val = fir.box_addr %descval : (!fir.box<!fir.ptr<!fir.array<?x?xf32>>>) ->
361    !fir.ptr<!fir.array<?x?xf32>>
362%attach_op = acc.attach
363    varPtr(%base_addr_val : !fir.ptr<!fir.array<?x?xf32>>)
364    augPtr(%descref : !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>>)
365    bounds(...)
366    attachRecipe(@attach_ref) ->
367    !fir.ptr<!fir.array<?x?xf32>> {name = "d%p", structured = false}
368acc.enter_data dataOperands(%attach_op : !fir.ptr<!fir.array<?x?xf32>>)
369```
370
371> Note that for languages not using augmented pointers, we can still use `varPtrPtr` operand to represent the "simple" pointer attachment.  The recipe should be omitted in this case.
372
373For other data clauses there is an implied ordering that the data action happens before the attachment:
374
375```Fortran
376  !$acc enter data copyin(d%p)
377```
378
379```
380%copyin_op = acc.copyin
381    varPtr(%base_addr_val : !fir.ptr<!fir.array<?x?xf32>>)
382    augPtr(%descref : !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>>)
383    bounds(...)
384    attachRecipe(@attach_ref) ->
385    !fir.ptr<!fir.array<?x?xf32>> {name = "d%p", structured = false}
386```
387
388Here, the `copyin` of the data is followed by the pointer attachment.
389
390### F18 runtime support
391
392The `OpenACCAttachDescriptor` API is defined like this:
393
394```C++
395/// Implement OpenACC attach semantics for the given Fortran descriptor.
396/// The \p data_ptr must match the descriptor's base_addr member value,
397/// it is only used for verification.
398/// The given \p offset and \p size specify an array section starting
399/// offset and the size of the contiguous section for the array cases,
400/// e.g. 'attach(array(2:3))'. For scalar cases, the offset must be 0,
401/// and the size must match the scalar size.
402///
403/// TODO: The API needs to take the device id.
404void RTNAME(OpenACCAttachDescriptor)(const Descriptor *descriptor_ptr,
405                                     const void *data_ptr,
406                                     std::size_t offset,
407                                     std::size_t size,
408                                     const char *sourceFile,
409                                     int sourceLine);
410```
411
412The implementation's behavior may be described as (OpenACC 2.7.2):
413
414* If the data described by the host address `data_ptr`, `offset` and `size` is not present on the device, RETURN.
415* If the data described by `descriptor_ptr` and the descriptor size is not present on the device, RETURN.
416* If the descriptor's attachment counter is not 0 and the host descriptor contents matches the host descriptor contents used for the previous attachment, then increment the attachment counter and RETURN.
417* Update descriptor on the device:
418  * Copy the host descriptor contents to device memory.
419
420  * Copy the device address corresponding to `data_ptr` into the `base_addr` member of the descriptor in device memory.
421
422  * Perform an appropriate data action for all auxiliary pointers, e.g. `present(addendum_ptr)`/`copyin(addendum_ptr[:size])`, and copy the corresponding device addresses into their locations in the descriptor in device memory.
423
424  * Set the descriptor's attachment counter to 1.
425
426* RETURN
427
428All the "is-present" checks and the data actions for the auxiliary pointers must be performed atomically with regards to the present counters bookkeeping.
429
430The API relies on the primitives provided by `liboffload`, so it is provided by a new F18 runtime library, e.g. `FortranOffloadRuntime`, that depends on `FortranRuntime` and `liboffload`.  The F18 driver adds `FortranOffloadRuntime` for linking under `-fopenacc`/`-fopenmp` (and maybe additional switches like `-fopenmp-targets`).
431
432## TODOs:
433
434* Cover the detach action.
435