1<!--===- docs/OpenACC-descriptor-management.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9# OpenACC dialect: Fortran descriptor management in the offload data environment 10 11## Overview 12 13This document describes the details of the Fortran descriptor management during the offload execution using OpenACC. 14 15LLVM Flang compiler uses an extended CFI descriptor data structure to represent some Fortran variables along with their characteristics in memory. For example, the descriptor is used to access dummy assumed shape arrays in a context of a subroutine where the arrays' bounds are not explicit but are rather passed to the subroutine via the descriptor storage. 16 17During the offload execution a variable data (i.e. the memory holding the actual value of the variable) can be transferred between the host and the device using explicit OpenACC constructs. Fortran language does not expose the descriptor representation to the user, but the accesses of variables with descriptors on the device may still be done using the descriptor data. Thus, the implementations must implicitly manage the data transfers of the descriptors along with the transfers of the actual values of the variables. 18 19The MLIR OpenACC dialect is language agnostic so that it can work for Fortran and C/C++ with OpenACC. The dialect should provide means for expressing the logic for data and descriptor management for the offload data environment. 20 21The chapter numbering in this document refers to: 22 23* F202x: Fortran standard J3/23-007r1 24* OpenACC: OpenACC specification version 3.3 25* OpenMP: OpenMP specification version 5.2 26 27## CFI descriptor structure 28 29Flang represents the data descriptors in memory using `CFI_cdesc_t` layout specified by F202x 18.5.3, i.e. the variable data address is located in its first member `base_addr`. The standard does not strictly specify the layout of all the members, moreover, the actual size of the structure may be different for different variables (e.g. due to different ranks). Other compilers may use different data descriptor formats, e.g. the variable data address may be interleaved with the auxiliary members, or the auxiliary data may be operated as a data structure not containing the variable data address. In this document we will only consider supporting the CFI descriptor format as supported by Flang. 30 31## Runtime behavior for variables with descriptors 32 33### Pointer variables 34 35OpenACC specifies the pointer attachment behavior in OpenACC 2.6.8. This paragraph applies both to Fortran and C/C++ pointers, and the Fortran specific representation of the POINTER variables representation is not explicitly specified. There is a single mention of the "Fortran descriptor (dope vector)" in 2.6.4 with regards to the POINTER/ALLOCATABLE members of data structures. Chapter 2.7.2 describes the behavior of different data clause actions including `attach` and `detach` actions for the pointers. Finally, chapters 2.7.4-13 describe what actions are taken for each OpenACC data clause. 36 37The spec operates in terms of the attachment counter associated with each pointer in device memory. This counter is not exposed to the user code explicitly, i.e. there is no standard way to query the attachment counter for a pointer, but there are `acc_attach*` and `acc_detach*` APIs that affect the attachment counter as well as the data clause actions used with OpenACC constructs. 38 39Here is an example to demonstrate the member pointer attachment: 40 41Example: 42 43```Fortran 44module types 45 type ty1 46 real, pointer :: p(:,:) 47 end type ty1 48end module types 49program main 50 use types 51 use openacc 52 type(ty1) :: d 53 real, pointer :: t1(:,:) 54 nullify(d%p) 55 allocate(t1(2,2)) 56 57 ! 2.7.9: 58 ! Allocates the memory for object 'd' on the device. 59 ! The descriptor member of 'd' is a NULL descriptor, 60 ! i.e. the host contents of the descriptor is copied 61 ! verbatim to the device. 62 ! The descriptor storage is created on the device 63 ! just as part of the object 'd' storage. 64 !$acc enter data create(d) 65 66 d%p => t1 67 68 ! 2.7.7: 69 ! Pointer d%p is not present on the device, so copyin 70 ! action is performed for the data pointed to by the pointer: 71 ! the memory for 'REAL :: (2,2)' array is allocated on the device 72 ! and the host values of the array elements are copied to 73 ! the allocated device memory. 74 ! Then the attach action is performed, i.e. the contents 75 ! of the device descriptor of d%p are updated as: 76 ! * The base_addr member of the descriptor on the device 77 ! is initialized to the device address of the data 78 ! that has been initialized during the copyin. 79 ! * The auxiliary members of the device descriptor are initialized 80 ! from the host values of the corresponding members. 81 ! * The attachment counter of 'd%p' is set to 1. 82 !$acc enter data copyin(d%p) 83 84 ! 2.7.7: 85 ! Pointer d%p is already present on the device, so no copyin 86 ! action is performed. 87 ! The attach action is performed according to 2.6.8: 88 ! since the pointer is associated with the same target as 89 ! during the previous attachment, only its attachment counter 90 ! is incremented to 2. 91 !$acc enter data copyin(d%p) 92 93 ! 3.2.29: 94 ! The detach action is performed. According to 2.7.2 the attachment 95 ! counter of d%p is decremented to 1. 96 call acc_detach(d%p) 97 98 ! 3.2.29: 99 ! The detach action is performed. According to 2.7.2 the attachment 100 ! counter of d%p is decremented to 0, which initiates an update 101 ! of the the device pointer to have the same value as the corresponding 102 ! pointer in the local memory. 103 ! We will discuss this in more detail below. 104 call acc_detach(d%p) 105 106 ! The following construct will fail, because the 'd%p' descriptor's 107 ! base_addr is now the host address not accessible on the device. 108 ! Without the second 'acc_detach' it will work correctly. 109 !$acc serial present(d) 110 print *, d%p(1,2) 111 !$acc end serial 112``` 113 114Let's discuss in more detail what happens during the second `acc_detach`. 115 116OpenACC 2.6.4: 117 118> 1360 An attach action updates the pointer in device memory to point to the device copy of the data 119> 1361 that the host pointer targets; see Section 2.7.2. For Fortran array pointers and allocatable arrays, 120> 1362 this includes copying any associated descriptor (dope vector) to the device copy of the pointer. 121> 1363 When the device pointer target is deallocated, the pointer in device memory should be restored 122> 1364 to the host value, so it can be safely copied back to host memory. A detach action updates the 123> 1365 pointer in device memory to have the same value as the corresponding pointer in local memory; 124> 1366 see Section 2.7.2. 125 126It explicitly says that the associated descriptor copy happens during the attach action, but it does not specify the same for the detach action. So one interpretation of this could be that only the `base_addr` member is updated, but this would allow chimera descriptors in codes like this: 127 128Example: 129 130```Fortran 131 !$acc enter data copyin(d) 132 d%p => t1 133 !$acc enter data copyin(d%p) 134 d%p(10:,10:) => d%p 135 call acc_detach(d%p) 136 !$acc exit data copyout(d) 137 print *, lbound(d%p) 138``` 139 140At the point of `acc_detach` the host descriptor of `d%p` points to `t1` data and has the lower bounds `(10:, 10:)`, so if the detach action only updates the `base_addr` member of the device descriptor and does not update the auxiliary members from their current host values, then during the `copyout(d)` the host descriptor `d%p` will have the stale lower bounds `(2:, 2:)`. 141 142So the proposed reading of the spec here is that the same note about the descriptor applies equally to the attach and the detach actions. 143 144#### "Moving target" 145 146According to OpenACC 2.6.8: 147 148> 1535 when the pointer is allocated in device memory. **The attachment counter for a pointer is set to one** 149> **1536 whenever the pointer is attached to new target address**, and incremented whenever an attach action 150> 1537 for that pointer is performed for the same target address. 151 152This clearly applies to the following example, where the second attach action is executed while the host pointer is attached to new target address comparing to the first attach action: 153 154Example: 155 156```Fortran 157 !$acc enter data copyin(d) 158 !$acc enter data copyin(t1, t2) 159 d%p => t1 160 !$acc enter data attach(d%p) 161 d%p => t2 162 !$acc enter data attach(d%p) 163``` 164 165The spec is not too explicit about the following example, though: 166 167Example: 168 169```Fortran 170 !$acc enter data copyin(d) 171 !$acc enter data copyin(t1, t2) 172 d%p => t1 173 !$acc enter data attach(d%p) 174 d%p(10:,10:) => d%p 175 !$acc enter data attach(d%p) 176``` 177 178The `d%p` pointer has not been attached to **new target address** between the two attach actions, so the device descriptor update is not strictly required, but the proposed spec reading is to execute the update as if the pointer has changed its association between the two attach actions (i.e. the attachment counter of `d%p` is 1 after the second attach, not 2; and the device descriptor is updated with the current values). 179 180In other words, the Fortran pointer in the context of the attach actions should be considered not just as the target address but as a combination of all the values embedded inside its Fortran descriptor. 181 182#### Pointer dummy arguments 183 184All the same rules apply to the pointer dummy arguments even though OpenACC spec, again, mentions the descriptor copying only in the context of member pointers. The following test demonstrates that a consistent implementation should apply 2.6.4 for pointer dummy arguments, otherwise, OpenACC programs may exhibit unexpected behavior after seemingly straightforward subroutine inlining/outlining: 185 186Example: 187 188```Fortran 189 type(ty1) :: d 190 real, pointer :: t1(:,:) 191 allocate(t1(2,2)) 192 call wrapper(d, d%p, t1) 193contains 194 subroutine wrapper(d, p, t1) 195 type(ty1), target :: d 196 real, pointer :: p(:,:) 197 real, pointer :: t1(:,:) 198 !$acc enter data copyin(d) 199 !$acc enter data copyin(t1) 200 p => t1 201 !$acc enter data copyin(p) 202 !$acc serial present(d) 203 print *, d%p(1,2) 204 !$acc end serial 205``` 206 207If the descriptor contents is not copied during the attach action implied by `copyin(p)`, then this code does not behave the same way as: 208 209```Fortran 210 type(ty1) :: d 211 real, pointer :: t1(:,:) 212 allocate(t1(2,2)) 213 !$acc enter data copyin(d) 214 !$acc enter data copyin(t1) 215 d%p => t1 216 !$acc enter data copyin(d%p) 217 !$acc serial present(d) 218 print *, d%p(1,2) 219 !$acc end serial 220 !call wrapper(d, d%p, t1) 221``` 222 223#### The descriptor storage allocation 224 225For POINTER members of aggregate variables the descriptor storage is allocated on the device as part of the allocation of the aggregate variable, which is done either explicitly in the user code or implicitly due to 2.6.2: 226 227> 1292 On a compute or combined construct, if a variable appears in a reduction clause but no other 228> 1293 data clause, it is treated as if it also appears in a copy clause. Otherwise, for any variable, the 229> 1294 compiler will implicitly determine its data attribute on a compute construct if all of the following 230> 1295 conditions are met: 231> ... 232> 1299 An aggregate variable will be treated as if it appears either: 233> 1300 • In a present clause if there is a default(present) clause visible at the compute con 234> 1301 struct. 235> 1302 • In a copy clause otherwise. 236 237Example: 238 239```Fortran 240 type(ty1) :: d 241 real, target :: t1(2,2) 242 d%p => t1 243 !$acc enter data copyin(d%p) 244 !$acc serial present(d%p) 245 print *, d%p(1,2) 246 !$acc end serial 247``` 248 249Due to `d%p` reference in the `present` clause of the `serial` region, the compiler must produce an implicit copy of `d`. In order for the `d%p` pointer attachment to happen the descriptor storage must be created before the attachment happens, so the following order of the clauses must be implied: 250 251```Fortran 252 !$acc serial copy(d) present(d%p) 253``` 254 255In the case of POINTER dummy argument, if the descriptor storage is not explicitly created in the user code, the pointer attachment may not happen due to 2.7.2: 256 257> 1693 If the pointer var is in shared memory or is not present in the current device memory, or if the 258> 1694 address to which var points is not present in the current device memory, no action is taken. 259 260Example: 261 262```Fortran 263 d%p => t1 264 !$acc enter data copyin(t1) 265 call wrapper(d%p) 266contains 267 subroutine wrapper(p) 268 real, pointer :: p(:,:) 269 !$acc serial attach(p) 270 print *, p(1,2) 271 !$acc end serial 272``` 273 274### Allocatable variables 275 276OpenACC 2.6.4 names both POINTER and ALLOCATABLE members of data structures as *pointer*, so the same attachment rules apply to both, including the case of dummy ALLOCATABLE arguments: 277 278Example: 279 280```Fortran 281module types 282 type ty2 283 real, allocatable :: a(:,:) 284 end type ty2 285end module types 286 use types 287 type(ty2), target :: dd 288 dd%a = reshape((/1,2,3,4/),(/2,2/)) 289 call wrapper(dd, dd%a) 290contains 291 subroutine wrapper(dd, a) 292 type(ty2), target :: dd 293 real, allocatable, target :: a(:,:) 294 !$acc enter data copyin(dd) 295 !$acc enter data copyin(a) 296 !$acc serial present(dd) 297 print *, dd%a(1,2) 298 !$acc end serial 299``` 300 301### Other variables 302 303F18 compiler also uses descriptors for assumed-shape, assumed-rank, polymorphic, ... variables. The OpenACC specification does not prescribe how an implementation should manage the descriptors for such variables. In many (all?) cases the descriptors of these variables have a local scope of a single subprogram, and if a descriptor of such a variable is created on the device, then its live range must be limited on the device by the invocation of the subprogram (with any OpenACC constructs inside it). 304 305For example: 306 307```Fortran 308 type(ty2), target :: dd 309 ... 310 call wrapper(dd%a) 311contains 312 subroutine wrapper(a) 313 real :: a(10:,10:) 314 !$acc serial copyin(a) 315 print *, a(10,11) 316 !$acc end serial 317``` 318 319The dummy assumed-shape argument `a` is represented with a descriptor, which has no storage overlap with `dd%a`, i.e. it is a temporary descriptor created to represent the data `dd%a` in a shape according to the declaration of the dummy argument `a`. The implementation is not strictly required to transfer all the values embedded inside the descriptor for `a` to the device. The only required actions for this code are the ones prescribed by the `copyin(a)` clause in 2.7.7. 320 321### Summary 322 323Pointer attachment for POINTER and ALLOCATABLE variables is a "composite" runtime action that involves the following: 324 325* Getting the device address corresponding to the device copy of the descriptor. 326* Comparing the current host descriptor contents with the device descriptor contents (for proper attachment counter updates). 327* Getting the device address corresponding to the device copy of the data pointed to by the descriptor. 328* Copying data from the host to the device to update the device copy of the descriptor: this data may include the device address of the data, the descriptor data describing the element size, dimensions, etc. 329* Descriptors with an F18 addendum may also require mapping the data pointed to by the addendum pointer(s) and attaching this pointer(s) into the device copy of the descriptor. 330 331## Representing pointer attachment in MLIR OpenACC dialect 332 333The Fortran pointer attachment logic specified by OpenACC is not trivial, and in order to be expressed in a language independent MLIR OpenACC dialect we propose to use recipes for delegating the complexity of the implementation to F18 runtime. 334 335```Fortran 336 !$acc enter data attach(d%p) 337``` 338 339The frontend generates an `acc.attach` data operation with `augPtr` being an address of the F18 descriptor representing a POINTER/ALLOCATABLE variable. Note that `augPtr` refers to an abstract augmented pointer structure, which is handled in a language specific manner by the code provided by the `attachRecipe` reference. 340 341The `attachRecipe` is a callback that takes `varPtr` and `augPtr` pointers, and the section's `offset` and `size` computed from the `bounds` operand of `acc.attach`. Fortran FE passes these arguments directly to F18 runtime that is aware of the descriptor structure and does all the required checks and device memory updates for the device copy of the descriptor, including the attachment counters updates. 342 343``` 344acc.attach.recipe @attach_ref : 345 (!fir.ref<none>, !fir.ref<!fir.box<none>>, index, index) { 346^bb0(%base_addr_val : !fir.ref<none>, 347 %aug_ptr : !fir.ref<!fir.box<none>>, 348 %offset : index, 349 %size : index): 350 fir.call _FortranAOpenACCAttachDescriptor(%aug_ptr, %base_addr_val, %offset, %size) : 351 (!fir.ref<none>, !fir.ref<!fir.box<none>>, index, index) -> () 352 acc.yield 353} 354 355%descref = hlfir.designate %0#0{"p"} 356 {fortran_attrs = #fir.var_attrs<pointer>} : 357 (!fir.ref<!fir.type<_QMtypesTty1{p:!fir.box<!fir.ptr<!fir.array<?x?xf32>>>}>>) -> 358 !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>> 359%descval = fir.load %descref : !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>> 360%base_addr_val = fir.box_addr %descval : (!fir.box<!fir.ptr<!fir.array<?x?xf32>>>) -> 361 !fir.ptr<!fir.array<?x?xf32>> 362%attach_op = acc.attach 363 varPtr(%base_addr_val : !fir.ptr<!fir.array<?x?xf32>>) 364 augPtr(%descref : !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>>) 365 bounds(...) 366 attachRecipe(@attach_ref) -> 367 !fir.ptr<!fir.array<?x?xf32>> {name = "d%p", structured = false} 368acc.enter_data dataOperands(%attach_op : !fir.ptr<!fir.array<?x?xf32>>) 369``` 370 371> Note that for languages not using augmented pointers, we can still use `varPtrPtr` operand to represent the "simple" pointer attachment. The recipe should be omitted in this case. 372 373For other data clauses there is an implied ordering that the data action happens before the attachment: 374 375```Fortran 376 !$acc enter data copyin(d%p) 377``` 378 379``` 380%copyin_op = acc.copyin 381 varPtr(%base_addr_val : !fir.ptr<!fir.array<?x?xf32>>) 382 augPtr(%descref : !fir.ref<!fir.box<!fir.ptr<!fir.array<?x?xf32>>>>) 383 bounds(...) 384 attachRecipe(@attach_ref) -> 385 !fir.ptr<!fir.array<?x?xf32>> {name = "d%p", structured = false} 386``` 387 388Here, the `copyin` of the data is followed by the pointer attachment. 389 390### F18 runtime support 391 392The `OpenACCAttachDescriptor` API is defined like this: 393 394```C++ 395/// Implement OpenACC attach semantics for the given Fortran descriptor. 396/// The \p data_ptr must match the descriptor's base_addr member value, 397/// it is only used for verification. 398/// The given \p offset and \p size specify an array section starting 399/// offset and the size of the contiguous section for the array cases, 400/// e.g. 'attach(array(2:3))'. For scalar cases, the offset must be 0, 401/// and the size must match the scalar size. 402/// 403/// TODO: The API needs to take the device id. 404void RTNAME(OpenACCAttachDescriptor)(const Descriptor *descriptor_ptr, 405 const void *data_ptr, 406 std::size_t offset, 407 std::size_t size, 408 const char *sourceFile, 409 int sourceLine); 410``` 411 412The implementation's behavior may be described as (OpenACC 2.7.2): 413 414* If the data described by the host address `data_ptr`, `offset` and `size` is not present on the device, RETURN. 415* If the data described by `descriptor_ptr` and the descriptor size is not present on the device, RETURN. 416* If the descriptor's attachment counter is not 0 and the host descriptor contents matches the host descriptor contents used for the previous attachment, then increment the attachment counter and RETURN. 417* Update descriptor on the device: 418 * Copy the host descriptor contents to device memory. 419 420 * Copy the device address corresponding to `data_ptr` into the `base_addr` member of the descriptor in device memory. 421 422 * Perform an appropriate data action for all auxiliary pointers, e.g. `present(addendum_ptr)`/`copyin(addendum_ptr[:size])`, and copy the corresponding device addresses into their locations in the descriptor in device memory. 423 424 * Set the descriptor's attachment counter to 1. 425 426* RETURN 427 428All the "is-present" checks and the data actions for the auxiliary pointers must be performed atomically with regards to the present counters bookkeeping. 429 430The API relies on the primitives provided by `liboffload`, so it is provided by a new F18 runtime library, e.g. `FortranOffloadRuntime`, that depends on `FortranRuntime` and `liboffload`. The F18 driver adds `FortranOffloadRuntime` for linking under `-fopenacc`/`-fopenmp` (and maybe additional switches like `-fopenmp-targets`). 431 432## TODOs: 433 434* Cover the detach action. 435