1<!--===- docs/ParserCombinators.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9# Parser Combinators 10 11```{contents} 12--- 13local: 14--- 15``` 16 17This document is a primer on Parser Combinators and their use in Flang. 18 19## Concept 20The Fortran language recognizer here can be classified as an LL recursive 21descent parser. It is composed from a *parser combinator* library that 22defines a few fundamental parsers and a few ways to compose them into more 23powerful parsers. 24 25For our purposes here, a *parser* is any object that attempts to recognize 26an instance of some syntax from an input stream. It may succeed or fail. 27On success, it may return some semantic value to its caller. 28 29In C++ terms, a parser is any instance of a class that 301. has a `constexpr` default constructor, 311. defines a type named `resultType`, and 321. provides a function (`const` member or `static`) that accepts a reference to a 33`ParseState` as its argument and returns a `std::optional<resultType>` as a 34result, with the presence or absence of a value in the `std::optional<>` 35signifying success or failure, respectively. 36``` 37std::optional<resultType> Parse(ParseState &) const; 38``` 39The `resultType` of a parser is typically the class type of some particular 40node type in the parse tree. 41 42`ParseState` is a class that encapsulates a position in the source stream, 43collects messages, and holds a few state flags that determine tokenization 44(e.g., are we in a character literal?). Instances of `ParseState` are 45independent and complete -- they are cheap to duplicate whenever necessary to 46implement backtracking. 47 48The `constexpr` default constructor of a parser is important. The functions 49(below) that operate on instances of parsers are themselves all `constexpr`. 50This use of compile-time expressions allows the entirety of a recursive 51descent parser for a language to be constructed at compilation time through 52the use of templates. 53 54### Fundamental Predefined Parsers 55These objects and functions are (or return) the fundamental parsers: 56 57* `ok` is a trivial parser that always succeeds without advancing. 58* `pure(x)` returns a trivial parser that always succeeds without advancing, 59 returning some value `x`. 60* `pure<T>()` is `pure(T{})` but does not require that T be copy-constructible. 61* `fail<T>(msg)` denotes a trivial parser that always fails, emitting the 62 given message as a side effect. The template parameter is the type of 63 the value that the parser never returns. 64* `nextCh` consumes the next character and returns its location, 65 and fails at EOF. 66* `"xyz"_ch` succeeds if the next character consumed matches any of those 67 in the string and returns its location. Be advised that the source 68 will have been normalized to lower case (miniscule) letters outside 69 character and Hollerith literals and edit descriptors before parsing. 70 71### Combinators 72These functions and operators combine existing parsers to generate new parsers. 73They are `constexpr`, so they should be viewed as type-safe macros. 74 75* `!p` succeeds if p fails, and fails if p succeeds. 76* `p >> q` fails if p does, otherwise running q and returning its value when 77 it succeeds. 78* `p / q` fails if p does, otherwise running q and returning p's value 79 if q succeeds. 80* `p || q` succeeds if p does, otherwise running q. The two parsers must 81 have the same type, and the value returned by the first succeeding parser 82 is the value of the combination. 83* `first(p1, p2, ...)` returns the value of the first parser that succeeds. 84 All of the parsers in the list must return the same type. 85 It is essentially the same as `p1 || p2 || ...` but has a slightly 86 faster implementation and may be easier to format in your code. 87* `lookAhead(p)` succeeds if p does, but doesn't modify any state. 88* `attempt(p)` succeeds if p does, safely preserving state on failure. 89* `many(p)` recognizes a greedy sequence of zero or more nonempty successes 90 of p, and returns `std::list<>` of their values. It always succeeds. 91* `some(p)` recognized a greedy sequence of one or more successes of p. 92 It fails if p immediately fails. 93* `skipMany(p)` is the same as `many(p)`, but it discards the results. 94* `maybe(p)` tries to match p, returning an `std::optional<T>` value. 95 It always succeeds. 96* `defaulted(p)` matches p, and when p fails it returns a 97 default-constructed instance of p's resultType. It always succeeds. 98* `nonemptySeparated(p, q)` repeatedly matches "p q p q p q ... p", 99 returning a `std::list<>` of only the values of the p's. It fails if 100 p immediately fails. 101* `extension<feature>([msg,]p)` parses p if strict standard compliance is 102 disabled, or with an optional warning when nonstandard usage warnings 103 are enabled. 104* `deprecated(p)` parses p if strict standard compliance is disabled, 105 with a warning if deprecated usage warnings are enabled. 106* `inContext(msg, p)` runs p within an error message context; any 107 message that `p` generates will be tagged with `msg` as its 108 context. Contexts may nest. 109* `withMessage(msg, p)` succeeds if `p` does, and if it does not, 110 it discards the messages from `p` and fails with the specified message. 111* `recovery(p, q)` is equivalent to `p || q`, except that error messages 112 generated from the first parser are retained, and a flag is set in 113 the ParseState to remember that error recovery was necessary. 114* `localRecovery(msg, p, q)` is equivalent to 115 `recovery(withMessage(msg, p), q >> pure<A>())` where `A` is the 116 result type of 'p'. 117 It is useful for targeted error recovery situations within statements. 118 119Note that 120``` 121a >> b >> c / d / e 122``` 123matches a sequence of five parsers, but returns only the result that was 124obtained by matching `c`. 125 126### Applicatives 127The following *applicative* combinators combine parsers and modify or 128collect the values that they return. 129 130* `construct<T>(p1, p2, ...)` matches zero or more parsers in succession, 131 collecting their results and then passing them with move semantics to a 132 constructor for the type T if they all succeed. 133 If there is a single parser as the argument and it returns no usable 134 value but only success or failure (_e.g.,_ `"IF"_tok`), the default 135 nullary constructor of the type `T` is called. 136* `sourced(p)` matches p, and fills in its `source` data member with the 137 locations of the cooked character stream that it consumed 138* `applyFunction(f, p1, p2, ...)` matches one or more parsers in succession, 139 collecting their results and passing them as rvalue reference arguments to 140 some function, returning its result. 141* `applyLambda([](&&x){}, p1, p2, ...)` is the same thing, but for lambdas 142 and other function objects. 143* `applyMem(mf, p1, p2, ...)` is the same thing, but invokes a member 144 function of the result of the first parser. 145 146### Token Parsers 147Last, we have these basic parsers on which the actual grammar of the Fortran 148is built. All of the following parsers consume characters acquired from 149`nextCh`. 150 151* `space` always succeeds after consuming any spaces 152* `spaceCheck` always succeeds after consuming any spaces, and can emit 153 a warning if there was no space in free form code before a character 154 that could continue a name or keyword 155* `digit` matches one cooked decimal digit (0-9) 156* `letter` matches one cooked letter (A-Z) 157* `"..."_tok` match the content of the string, skipping spaces before and 158 after. Internal spaces are optional matches. The `_tok` suffix is 159 optional when the parser appears before the combinator `>>` or after 160 the combinator `/`. If the quoted string ends in a character that 161 could appear in an identifier, a missing space will be diagnosed in 162 free form source in pedantic mode if the next character could also 163 be part of an identifier -- add a trailing blank to avoid this. 164* `"..."_sptok` is a string match in which the spaces are required in 165 free form source. 166* `"..."_id` is a string match for a complete identifier (not a prefix of 167 a longer identifier or keyword). 168* `parenthesized(p)` is shorthand for `"(" >> p / ")"`. 169* `bracketed(p)` is shorthand for `"[" >> p / "]"`. 170* `nonemptyList(p)` matches a comma-separated list of one or more 171 instances of p. 172* `nonemptyList(errorMessage, p)` is equivalent to 173 `withMessage(errorMessage, nonemptyList(p))`, which allows one to supply 174 a meaningful error message in the event of an empty list. 175* `optionalList(p)` is the same thing, but can be empty, and always succeeds. 176 177### Debugging Parser 178Last, a string literal `"..."_debug` denotes a parser that emits the string to 179`llvm::errs` and succeeds. It is useful for tracing while debugging a parser but should 180obviously not be committed for production code. 181 182### Messages 183A list of generated error and warning messages is maintained in the `ParseState`. 184The parser combinator that handles alternatives (`||` and `first()`) will 185discard the messages from alternatives that fail when there is an alternative 186that succeeds. 187But when no alternative succeeds, and the alternative parser as a whole is 188failing, the messages that survive are chosen from the alternative that 189recognized any input tokens, if only one alternative did so; 190and when multiple alternatives recognized tokens, the messages from the 191alternative that proceeded the furthest into the input are retained. 192This strategy tends to show the most useful error messages to the user 193in situations where a statement fails to parse. 194