1<!--===- docs/Parsing.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9# The F18 Parser 10 11```{contents} 12--- 13local: 14--- 15``` 16 17This program source code implements a parser for the Fortran programming 18language. 19 20The draft ISO standard for Fortran 2018 dated July 2017 was used as the 21primary definition of the language. The parser also accepts many features 22from previous versions of the standard that are no longer part of the Fortran 232018 language. 24 25It also accepts many features that have never been part of any version 26of the standard Fortran language but have been supported by previous 27implementations and are known or suspected to remain in use. As a 28general principle, we want to recognize and implement any such feature 29so long as it does not conflict with requirements of the current standard 30for Fortran. 31 32The parser is implemented in standard ISO C++ and requires the 2017 33edition of the language and library. The parser constitutes a reentrant 34library with no mutable or constructed static data. Best modern C++ 35programming practices are observed to ensure that the ownership of 36dynamic memory is clear, that value rather than object semantics are 37defined for the data structures, that most functions are free from 38invisible side effects, and that the strictest available type checking 39is enforced by the C++ compiler when the Fortran parser is built. 40Class inheritance is rare and dynamic polymorphism is avoided in favor 41of modern discriminated unions. To the furthest reasonable extent, the 42parser has been implemented in a declarative fashion that corresponds 43closely to the text of the Fortran language standard. 44 45The several major modules of the Fortran parser are composed into a 46top-level Parsing class, by means of which one may drive the parsing of a 47source file and receive its parse tree and error messages. The interfaces 48of the Parsing class correspond to the two major passes of the parser, 49which are described below. 50 51## Prescanning and Preprocessing 52 53The first pass is performed by an instance of the Prescanner class, 54with help from an instance of Preprocessor. 55 56The prescanner generates the "cooked character stream", implemented 57by a CookedSource class instance, in which: 58* line ends have been normalized to single ASCII LF characters (UNIX newlines) 59* all `INCLUDE` files have been expanded 60* all continued Fortran source lines have been unified 61* all comments and insignificant spaces have been removed 62* fixed form right margins have been clipped 63* extra blank card columns have been inserted into character literals 64 and Hollerith constants 65* preprocessing directives have been implemented 66* preprocessing macro invocations have been expanded 67* legacy `D` lines in fixed form source have been omitted or included 68* except for the payload in character literals, Hollerith constants, 69 and character and Hollerith edit descriptors, all letters have been 70 normalized to lower case 71* all original non-ASCII characters in Hollerith constants have been 72 decoded and re-encoded into UTF-8 73 74Lines in the cooked character stream can be of arbitrary length. 75 76The purpose of the cooked character stream is to enable the implementation 77of a parser whose sole concern is the recognition of the Fortran language 78from productions that closely correspond to the grammar that is presented 79in the Fortran standard, without having to deal with the complexity of 80all of the source-level concerns in the preceding list. 81 82The implementation of the preprocessor interacts with the prescanner by 83means of _token sequences_. These are partitionings of input lines into 84contiguous virtual blocks of characters, and are the only place in this 85Fortran compiler in which we have reified a tokenization of the program 86source; the parser proper does not have a tokenizer. The prescanner 87builds these token sequences out of source lines and supplies them 88to the preprocessor, which interprets directives and expands macro 89invocations. The token sequences returned by the preprocessor are then 90marshaled to constitute the cooked character stream that is the output of 91the prescanner. 92 93The preprocessor and prescanner can both instantiate new temporary 94instances of the Prescanner class to locate, open, and process any 95include files. 96 97The tight interaction and mutual design of the prescanner and preprocessor 98enable a principled implementation of preprocessing for the Fortran 99language that implements a reasonable facsimile of the C language 100preprocessor that is fully aware of Fortran's source forms, line 101continuation mechanisms, case insensitivity, token syntax, &c. 102 103The preprocessor always runs. There's no good reason for it not to. 104 105The content of the cooked character stream is available and useful 106for debugging, being as it is a simple value forwarded from the first major 107pass of the compiler to the second. 108 109## Source Provenance 110 111The prescanner constructs a chronicle of every file that is read by the 112parser, viz. the original source file and all others that it directly 113or indirectly includes. One copy of the content of each of these files 114is mapped or read into the address space of the parser. Memory mapping 115is used initially, but files with DOS line breaks or a missing terminal 116newline are immediately normalized in a buffer when necessary. 117 118The virtual input stream, which marshals every appearance of every file 119and every expansion of every macro invocation, is not materialized as 120an actual stream of bytes. There is, however, a mapping from each byte 121position in this virtual input stream back to whence it came (maintained 122by an instance of the AllSources class). Offsets into this virtual input 123stream constitute values of the Provenance class. Provenance values, 124and contiguous ranges thereof, are used to describe and delimit source 125positions for messaging. 126 127Further, every byte in the cooked character stream supplied by the 128prescanner to the parser can be inexpensively mapped to its provenance. 129Simple `const char *` pointers to characters in the cooked character 130stream, or to contiguous ranges thereof, are used as source position 131indicators within the parser and in the parse tree. 132 133## Messages 134 135Message texts, and snprintf-like formatting strings for constructing 136messages, are instantiated in the various components of the parser with 137C++ user defined character literals tagged with `_err_en_US`, `_warn_en_US`, 138`port_en_US`, `because_en_US`, `todo_en_US`, and `_en_US` to signify severity 139and language. 140The default language is the dialect of English used in the United States. 141 142All "fatal" errors that do not immediately abort compilation but do 143prevent the generation of binary and module files are `_err_en_US`. 144Warnings about detected flaws in the program that probably indicate 145problems worth attention are `_warn_en_US`. 146Non-conforming extensions, legacy features, and obsolescent or deleted 147features will raise `_port_en_US` messages when those are enabled. 148Messages that are explanatory attachments to others are `_because_en_US`. 149Messages signifying an incomplete compiler feature are `_todo_en_US`. 150Other messages have a simple `_en_US` suffix. 151 152As described above, messages are associated with 153source code positions by means of provenance values. 154 155## The Parse Tree 156 157Each of the ca. 450 numbered requirement productions in the standard 158Fortran language grammar, as well as the productions implied by legacy 159extensions and preserved obsolescent features, maps to a distinct class 160in the parse tree so as to maximize the efficacy of static type checking 161by the C++ compiler. 162 163A transcription of the Fortran grammar appears with production requirement 164numbers in the commentary before these class definitions, so that one 165may easily refer to the standard (or to the parse tree definitions while 166reading that document). 167 168Three paradigms collectively implement most of the parse tree classes: 169* *wrappers*, in which a single data member `v` has been encapsulated 170 in a new type 171* *tuples* (or product types), in which several values of arbitrary 172 types have been encapsulated in a single data member `t` whose type 173 is an instance of `std::tuple<>` 174* *discriminated unions* (or sum types), in which one value whose type is 175 a dynamic selection from a set of distinct types is saved in a data 176 member `u` whose type is an instance of `std::variant<>` 177 178The use of these patterns is a design convenience, and exceptions to them 179are not uncommon wherever it made better sense to write custom definitions. 180 181Parse tree entities should be viewed as values, not objects; their 182addresses should not be abused for purposes of identification. They are 183assembled with C++ move semantics during parse tree construction. 184Their default and copy constructors are deliberately deleted in their 185declarations. 186 187The std::list<> data type is used in the parse tree to reliably store pointers 188to other relevant entries in the tree. Since the tree lists are moved and 189spliced at certain points std::list<> provides the necessary guarantee of the 190stability of pointers into these lists. 191 192There is a general purpose library by means of which parse trees may 193be traversed. 194 195## Parsing 196 197This compiler attempts to recognize the entire cooked character stream 198(see above) as a Fortran program. It records the reductions made during 199a successful recognition as a parse tree value. The recognized grammar 200is that of a whole source file, not just of its possible statements, 201so the parser has no global state that tracks the subprogram hierarchy 202or the structure of their nested block constructs. The parser performs 203no semantic analysis along the way, deferring all of that work to the 204next pass of the compiler. 205 206The resulting parse tree therefore necessarily contains ambiguous parses 207that cannot be resolved without recourse to a symbol table. Most notably, 208leading assignments to array elements can be misrecognized as statement 209function definitions, and array element references can be misrecognized 210as function calls. The semantic analysis phase of the compiler performs 211local rewrites of the parse tree once it can be disambiguated by symbols 212and types. 213 214Formally speaking, this parser is based on recursive descent with 215localized backtracking (specifically, it will not backtrack into a 216successful reduction to try its other alternatives). It is not generated 217as a table or code from a specification of the Fortran grammar; rather, it 218_is_ the grammar, as declaratively respecified in C++ constant expressions 219using a small collection of basic token recognition objects and a library 220of "parser combinator" template functions that compose them to form more 221complicated recognizers and their correspondences to the construction 222of parse tree values. 223 224## Unparsing 225 226Parse trees can be converted back into free form Fortran source code. 227This formatter is not really a classical "pretty printer", but is 228more of a data structure dump whose output is suitable for compilation 229by another compiler. It is also used for testing the parser, since a 230reparse of an unparsed parse tree should be an identity function apart from 231source provenance. 232