1*4fe0f936SmrgThis is cppinternals.info, produced by makeinfo version 6.8 from 296d60fd4Smrgcppinternals.texi. 36fac5056Sskrll 46fac5056SskrllINFO-DIR-SECTION Software development 56fac5056SskrllSTART-INFO-DIR-ENTRY 66fac5056Sskrll* Cpplib: (cppinternals). Cpplib internals. 76fac5056SskrllEND-INFO-DIR-ENTRY 86fac5056Sskrll 96fac5056SskrllThis file documents the internals of the GNU C Preprocessor. 106fac5056Sskrll 11e9e6e0f6Smrg Copyright (C) 2000-2022 Free Software Foundation, Inc. 126fac5056Sskrll 136fac5056Sskrll Permission is granted to make and distribute verbatim copies of this 146fac5056Sskrllmanual provided the copyright notice and this permission notice are 156fac5056Sskrllpreserved on all copies. 166fac5056Sskrll 176fac5056Sskrll Permission is granted to copy and distribute modified versions of 186fac5056Sskrllthis manual under the conditions for verbatim copying, provided also 1996d60fd4Smrgthat the entire resulting derived work is distributed under the terms of 2096d60fd4Smrga permission notice identical to this one. 216fac5056Sskrll 226fac5056Sskrll Permission is granted to copy and distribute translations of this 236fac5056Sskrllmanual into another language, under the above conditions for modified 246fac5056Sskrllversions. 256fac5056Sskrll 266fac5056Sskrll 276fac5056SskrllFile: cppinternals.info, Node: Top, Next: Conventions, Up: (dir) 286fac5056Sskrll 296fac5056SskrllThe GNU C Preprocessor Internals 306fac5056Sskrll******************************** 316fac5056Sskrll 3296d60fd4Smrg* Menu: 3396d60fd4Smrg 3496d60fd4Smrg* Conventions:: 3596d60fd4Smrg* Lexer:: 3696d60fd4Smrg* Hash Nodes:: 3796d60fd4Smrg* Macro Expansion:: 3896d60fd4Smrg* Token Spacing:: 3996d60fd4Smrg* Line Numbering:: 4096d60fd4Smrg* Guard Macros:: 4196d60fd4Smrg* Files:: 4296d60fd4Smrg* Concept Index:: 4396d60fd4Smrg 446fac5056Sskrll1 Cpplib--the GNU C Preprocessor 456fac5056Sskrll******************************** 466fac5056Sskrll 476fac5056SskrllThe GNU C preprocessor is implemented as a library, "cpplib", so it can 486fac5056Sskrllbe easily shared between a stand-alone preprocessor, and a preprocessor 496fac5056Sskrllintegrated with the C, C++ and Objective-C front ends. It is also 506fac5056Sskrllavailable for use by other programs, though this is not recommended as 516fac5056Sskrllits exposed interface has not yet reached a point of reasonable 526fac5056Sskrllstability. 536fac5056Sskrll 546fac5056Sskrll The library has been written to be re-entrant, so that it can be used 556fac5056Sskrllto preprocess many files simultaneously if necessary. It has also been 566fac5056Sskrllwritten with the preprocessing token as the fundamental unit; the 576fac5056Sskrllpreprocessor in previous versions of GCC would operate on text strings 586fac5056Sskrllas the fundamental unit. 596fac5056Sskrll 606fac5056Sskrll This brief manual documents the internals of cpplib, and explains 6196d60fd4Smrgsome of the tricky issues. It is intended that, along with the comments 6296d60fd4Smrgin the source code, a reasonably competent C programmer should be able 6396d60fd4Smrgto figure out what the code is doing, and why things have been 646fac5056Sskrllimplemented the way they have. 656fac5056Sskrll 666fac5056Sskrll* Menu: 676fac5056Sskrll 686fac5056Sskrll* Conventions:: Conventions used in the code. 696fac5056Sskrll* Lexer:: The combined C, C++ and Objective-C Lexer. 706fac5056Sskrll* Hash Nodes:: All identifiers are entered into a hash table. 716fac5056Sskrll* Macro Expansion:: Macro expansion algorithm. 726fac5056Sskrll* Token Spacing:: Spacing and paste avoidance issues. 736fac5056Sskrll* Line Numbering:: Tracking location within files. 746fac5056Sskrll* Guard Macros:: Optimizing header files with guard macros. 756fac5056Sskrll* Files:: File handling. 766fac5056Sskrll* Concept Index:: Index. 776fac5056Sskrll 786fac5056Sskrll 796fac5056SskrllFile: cppinternals.info, Node: Conventions, Next: Lexer, Prev: Top, Up: Top 806fac5056Sskrll 816fac5056SskrllConventions 826fac5056Sskrll*********** 836fac5056Sskrll 8496d60fd4Smrgcpplib has two interfaces--one is exposed internally only, and the other 8596d60fd4Smrgis for both internal and external use. 866fac5056Sskrll 876fac5056Sskrll The convention is that functions and types that are exposed to 8896d60fd4Smrgmultiple files internally are prefixed with '_cpp_', and are to be found 8996d60fd4Smrgin the file 'internal.h'. Functions and types exposed to external 9096d60fd4Smrgclients are in 'cpplib.h', and prefixed with 'cpp_'. For historical 916fac5056Sskrllreasons this is no longer quite true, but we should strive to stick to 926fac5056Sskrllit. 936fac5056Sskrll 9496d60fd4Smrg We are striving to reduce the information exposed in 'cpplib.h' to 956fac5056Sskrllthe bare minimum necessary, and then to keep it there. This makes clear 966fac5056Sskrllexactly what external clients are entitled to assume, and allows us to 976fac5056Sskrllchange internals in the future without worrying whether library clients 986fac5056Sskrllare perhaps relying on some kind of undocumented implementation-specific 996fac5056Sskrllbehavior. 1006fac5056Sskrll 1016fac5056Sskrll 1026fac5056SskrllFile: cppinternals.info, Node: Lexer, Next: Hash Nodes, Prev: Conventions, Up: Top 1036fac5056Sskrll 1046fac5056SskrllThe Lexer 1056fac5056Sskrll********* 1066fac5056Sskrll 1076fac5056SskrllOverview 1086fac5056Sskrll======== 1096fac5056Sskrll 110e9e6e0f6SmrgThe lexer is contained in the file 'lex.cc'. It is a hand-coded lexer, 1116fac5056Sskrlland not implemented as a state machine. It can understand C, C++ and 1126fac5056SskrllObjective-C source code, and has been extended to allow reasonably 1136fac5056Sskrllsuccessful preprocessing of assembly language. The lexer does not make 1146fac5056Sskrllan initial pass to strip out trigraphs and escaped newlines, but handles 1156fac5056Sskrllthem as they are encountered in a single pass of the input file. It 1166fac5056Sskrllreturns preprocessing tokens individually, not a line at a time. 1176fac5056Sskrll 1186fac5056Sskrll It is mostly transparent to users of the library, since the library's 11996d60fd4Smrginterface for obtaining the next token, 'cpp_get_token', takes care of 1206fac5056Sskrlllexing new tokens, handling directives, and expanding macros as 1216fac5056Sskrllnecessary. However, the lexer does expose some functionality so that 1226fac5056Sskrllclients of the library can easily spell a given token, such as 12396d60fd4Smrg'cpp_spell_token' and 'cpp_token_len'. These functions are useful when 1246fac5056Sskrllgenerating diagnostics, and for emitting the preprocessed output. 1256fac5056Sskrll 1266fac5056SskrllLexing a token 1276fac5056Sskrll============== 1286fac5056Sskrll 12996d60fd4SmrgLexing of an individual token is handled by '_cpp_lex_direct' and its 1306fac5056Sskrllsubroutines. In its current form the code is quite complicated, with 1316fac5056Sskrllread ahead characters and such-like, since it strives to not step back 1326fac5056Sskrllin the character stream in preparation for handling non-ASCII file 1336fac5056Sskrllencodings. The current plan is to convert any such files to UTF-8 1346fac5056Sskrllbefore processing them. This complexity is therefore unnecessary and 1356fac5056Sskrllwill be removed, so I'll not discuss it further here. 1366fac5056Sskrll 13796d60fd4Smrg The job of '_cpp_lex_direct' is simply to lex a token. It is not 1386fac5056Sskrllresponsible for issues like directive handling, returning lookahead 1396fac5056Sskrlltokens directly, multiple-include optimization, or conditional block 140*4fe0f936Smrgskipping. It necessarily has a minor rôle to play in memory management 14196d60fd4Smrgof lexed lines. I discuss these issues in a separate section (*note 14296d60fd4SmrgLexing a line::). 1436fac5056Sskrll 1446fac5056Sskrll The lexer places the token it lexes into storage pointed to by the 14596d60fd4Smrgvariable 'cur_token', and then increments it. This variable is 1466fac5056Sskrllimportant for correct diagnostic positioning. Unless a specific line 1476fac5056Sskrlland column are passed to the diagnostic routines, they will examine the 14896d60fd4Smrg'line' and 'col' values of the token just before the location that 14996d60fd4Smrg'cur_token' points to, and use that location to report the diagnostic. 1506fac5056Sskrll 1516fac5056Sskrll The lexer does not consider whitespace to be a token in its own 1526fac5056Sskrllright. If whitespace (other than a new line) precedes a token, it sets 15396d60fd4Smrgthe 'PREV_WHITE' bit in the token's flags. Each token has its 'line' 15496d60fd4Smrgand 'col' variables set to the line and column of the first character of 15596d60fd4Smrgthe token. This line number is the line number in the translation unit, 15696d60fd4Smrgand can be converted to a source (file, line) pair using the line map 15796d60fd4Smrgcode. 1586fac5056Sskrll 15996d60fd4Smrg The first token on a logical, i.e. unescaped, line has the flag 'BOL' 16096d60fd4Smrgset for beginning-of-line. This flag is intended for internal use, both 16196d60fd4Smrgto distinguish a '#' that begins a directive from one that doesn't, and 16296d60fd4Smrgto generate a call-back to clients that want to be notified about the 16396d60fd4Smrgstart of every non-directive line with tokens on it. Clients cannot 16496d60fd4Smrgreliably determine this for themselves: the first token might be a 16596d60fd4Smrgmacro, and the tokens of a macro expansion do not have the 'BOL' flag 16696d60fd4Smrgset. The macro expansion may even be empty, and the next token on the 16796d60fd4Smrgline certainly won't have the 'BOL' flag set. 1686fac5056Sskrll 1696fac5056Sskrll New lines are treated specially; exactly how the lexer handles them 1706fac5056Sskrllis context-dependent. The C standard mandates that directives are 1716fac5056Sskrllterminated by the first unescaped newline character, even if it appears 1726fac5056Sskrllin the middle of a macro expansion. Therefore, if the state variable 17396d60fd4Smrg'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is 17496d60fd4Smrgnormally used to indicate end-of-file, to indicate end-of-directive. In 17596d60fd4Smrga directive a 'CPP_EOF' token never means end-of-file. Conveniently, if 17696d60fd4Smrgthe caller was 'collect_args', it already handles 'CPP_EOF' as if it 17796d60fd4Smrgwere end-of-file, and reports an error about an unterminated macro 17896d60fd4Smrgargument list. 1796fac5056Sskrll 1806fac5056Sskrll The C standard also specifies that a new line in the middle of the 1816fac5056Sskrllarguments to a macro is treated as whitespace. This white space is 182a41324a9Smrgimportant in case the macro argument is stringized. The state variable 18396d60fd4Smrg'parsing_args' is nonzero when the preprocessor is collecting the 1846fac5056Sskrllarguments to a macro call. It is set to 1 when looking for the opening 1856fac5056Sskrllparenthesis to a function-like macro, and 2 when collecting the actual 1866fac5056Sskrllarguments up to the closing parenthesis, since these two cases need to 1876fac5056Sskrllbe distinguished sometimes. One such time is here: the lexer sets the 18896d60fd4Smrg'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args' 1896fac5056Sskrllis set to 2. It doesn't set it if it meets a new line when 19096d60fd4Smrg'parsing_args' is 1, since then code like 1916fac5056Sskrll 1926fac5056Sskrll #define foo() bar 1936fac5056Sskrll foo 1946fac5056Sskrll baz 1956fac5056Sskrll 19696d60fd4Smrgwould be output with an erroneous space before 'baz': 1976fac5056Sskrll 1986fac5056Sskrll foo 1996fac5056Sskrll baz 2006fac5056Sskrll 2016fac5056Sskrll This is a good example of the subtlety of getting token spacing 2026fac5056Sskrllcorrect in the preprocessor; there are plenty of tests in the testsuite 2036fac5056Sskrllfor corner cases like this. 2046fac5056Sskrll 20596d60fd4Smrg The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r' 2066fac5056Sskrllas a single new line indicator. This allows it to transparently 2076fac5056Sskrllpreprocess MS-DOS, Macintosh and Unix files without their needing to 2086fac5056Sskrllpass through a special filter beforehand. 2096fac5056Sskrll 21096d60fd4Smrg We also decided to treat a backslash, either '\' or the trigraph 21196d60fd4Smrg'??/', separated from one of the above newline indicators by non-comment 21296d60fd4Smrgwhitespace only, as intending to escape the newline. It tends to be a 21396d60fd4Smrgtyping mistake, and cannot reasonably be mistaken for anything else in 21496d60fd4Smrgany of the C-family grammars. Since handling it this way is not 21596d60fd4Smrgstrictly conforming to the ISO standard, the library issues a warning 21696d60fd4Smrgwherever it encounters it. 2176fac5056Sskrll 2186fac5056Sskrll Handling newlines like this is made simpler by doing it in one place 21996d60fd4Smrgonly. The function 'handle_newline' takes care of all newline 22096d60fd4Smrgcharacters, and 'skip_escaped_newlines' takes care of arbitrarily long 22196d60fd4Smrgsequences of escaped newlines, deferring to 'handle_newline' to handle 2226fac5056Sskrllthe newlines themselves. 2236fac5056Sskrll 2246fac5056Sskrll The most painful aspect of lexing ISO-standard C and C++ is handling 2256fac5056Sskrlltrigraphs and backlash-escaped newlines. Trigraphs are processed before 2266fac5056Sskrllany interpretation of the meaning of a character is made, and 2276fac5056Sskrllunfortunately there is a trigraph representation for a backslash, so it 22896d60fd4Smrgis possible for the trigraph '??/' to introduce an escaped newline. 2296fac5056Sskrll 2306fac5056Sskrll Escaped newlines are tedious because theoretically they can occur 23196d60fd4Smrganywhere--between the '+' and '=' of the '+=' token, within the 23296d60fd4Smrgcharacters of an identifier, and even between the '*' and '/' that 2336fac5056Sskrllterminates a comment. Moreover, you cannot be sure there is just 2346fac5056Sskrllone--there might be an arbitrarily long sequence of them. 2356fac5056Sskrll 23696d60fd4Smrg So, for example, the routine that lexes a number, 'parse_number', 2376fac5056Sskrllcannot assume that it can scan forwards until the first non-number 23896d60fd4Smrgcharacter and be done with it, because this could be the '\' introducing 23996d60fd4Smrgan escaped newline, or the '?' introducing the trigraph sequence that 24096d60fd4Smrgrepresents the '\' of an escaped newline. If it encounters a '?' or 24196d60fd4Smrg'\', it calls 'skip_escaped_newlines' to skip over any potential escaped 24296d60fd4Smrgnewlines before checking whether the number has been finished. 2436fac5056Sskrll 24496d60fd4Smrg Similarly code in the main body of '_cpp_lex_direct' cannot simply 24596d60fd4Smrgcheck for a '=' after a '+' character to determine whether it has a '+=' 24696d60fd4Smrgtoken; it needs to be prepared for an escaped newline of some sort. 24796d60fd4SmrgSuch cases use the function 'get_effective_char', which returns the 24896d60fd4Smrgfirst character after any intervening escaped newlines. 2496fac5056Sskrll 2506fac5056Sskrll The lexer needs to keep track of the correct column position, 25196d60fd4Smrgincluding counting tabs as specified by the '-ftabstop=' option. This 2526fac5056Sskrllshould be done even within C-style comments; they can appear in the 2536fac5056Sskrllmiddle of a line, and we want to report diagnostics in the correct 2546fac5056Sskrllposition for text appearing after the end of the comment. 2556fac5056Sskrll 25696d60fd4Smrg Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may 25796d60fd4Smrgbe invalid and require a diagnostic. However, if they appear in a macro 25896d60fd4Smrgexpansion we don't want to complain with each use of the macro. It is 25996d60fd4Smrgtherefore best to catch them during the lexing stage, in 26096d60fd4Smrg'parse_identifier'. In both cases, whether a diagnostic is needed or 2616fac5056Sskrllnot is dependent upon the lexer's state. For example, we don't want to 2626fac5056Sskrllissue a diagnostic for re-poisoning a poisoned identifier, or for using 26396d60fd4Smrg'__VA_ARGS__' in the expansion of a variable-argument macro. Therefore 26496d60fd4Smrg'parse_identifier' makes use of state flags to determine whether a 2656fac5056Sskrlldiagnostic is appropriate. Since we change state on a per-token basis, 2666fac5056Sskrlland don't lex whole lines at a time, this is not a problem. 2676fac5056Sskrll 2686fac5056Sskrll Another place where state flags are used to change behavior is whilst 26996d60fd4Smrglexing header names. Normally, a '<' would be lexed as a single token. 27096d60fd4SmrgAfter a '#include' directive, though, it should be lexed as a single 27196d60fd4Smrgtoken as far as the nearest '>' character. Note that we don't allow the 27296d60fd4Smrgterminators of header names to be escaped; the first '"' or '>' 2736fac5056Sskrllterminates the header name. 2746fac5056Sskrll 2756fac5056Sskrll Interpretation of some character sequences depends upon whether we 2766fac5056Sskrllare lexing C, C++ or Objective-C, and on the revision of the standard in 27796d60fd4Smrgforce. For example, '::' is a single token in C++, but in C it is two 27896d60fd4Smrgseparate ':' tokens and almost certainly a syntax error. Such cases are 27996d60fd4Smrghandled by '_cpp_lex_direct' based upon command-line flags stored in the 28096d60fd4Smrg'cpp_options' structure. 2816fac5056Sskrll 2826fac5056Sskrll Once a token has been lexed, it leads an independent existence. The 2836fac5056Sskrllspelling of numbers, identifiers and strings is copied to permanent 2846fac5056Sskrllstorage from the original input buffer, so a token remains valid and 28596d60fd4Smrgcorrect even if its source buffer is freed with '_cpp_pop_buffer'. The 2866fac5056Sskrllstorage holding the spellings of such tokens remains until the client 2876fac5056Sskrllprogram calls cpp_destroy, probably at the end of the translation unit. 2886fac5056Sskrll 2896fac5056SskrllLexing a line 2906fac5056Sskrll============= 2916fac5056Sskrll 2926fac5056SskrllWhen the preprocessor was changed to return pointers to tokens, one 2936fac5056Sskrllfeature I wanted was some sort of guarantee regarding how long a 2946fac5056Sskrllreturned pointer remains valid. This is important to the stand-alone 2956fac5056Sskrllpreprocessor, the future direction of the C family front ends, and even 2966fac5056Sskrllto cpplib itself internally. 2976fac5056Sskrll 2986fac5056Sskrll Occasionally the preprocessor wants to be able to peek ahead in the 2996fac5056Sskrlltoken stream. For example, after the name of a function-like macro, it 3006fac5056Sskrllwants to check the next token to see if it is an opening parenthesis. 3016fac5056SskrllAnother example is that, after reading the first few tokens of a 30296d60fd4Smrg'#pragma' directive and not recognizing it as a registered pragma, it 3036fac5056Sskrllwants to backtrack and allow the user-defined handler for unknown 30496d60fd4Smrgpragmas to access the full '#pragma' token stream. The stand-alone 3056fac5056Sskrllpreprocessor wants to be able to test the current token with the 3066fac5056Sskrllprevious one to see if a space needs to be inserted to preserve their 3076fac5056Sskrllseparate tokenization upon re-lexing (paste avoidance), so it needs to 3086fac5056Sskrllbe sure the pointer to the previous token is still valid. The 3096fac5056Sskrllrecursive-descent C++ parser wants to be able to perform tentative 3106fac5056Sskrllparsing arbitrarily far ahead in the token stream, and then to be able 3116fac5056Sskrllto jump back to a prior position in that stream if necessary. 3126fac5056Sskrll 3136fac5056Sskrll The rule I chose, which is fairly natural, is to arrange that the 3146fac5056Sskrllpreprocessor lex all tokens on a line consecutively into a token buffer, 3156fac5056Sskrllwhich I call a "token run", and when meeting an unescaped new line 3166fac5056Sskrll(newlines within comments do not count either), to start lexing back at 31796d60fd4Smrgthe beginning of the run. Note that we do _not_ lex a line of tokens at 31896d60fd4Smrgonce; if we did that 'parse_identifier' would not have state flags 3196fac5056Sskrllavailable to warn about invalid identifiers (*note Invalid 3206fac5056Sskrllidentifiers::). 3216fac5056Sskrll 3226fac5056Sskrll In other words, accessing tokens that appeared earlier in the current 3236fac5056Sskrllline is valid, but since each logical line overwrites the tokens of the 3246fac5056Sskrllprevious line, tokens from prior lines are unavailable. In particular, 3256fac5056Sskrllsince a directive only occupies a single logical line, this means that 32696d60fd4Smrgthe directive handlers like the '#pragma' handler can jump around in the 32796d60fd4Smrgdirective's tokens if necessary. 3286fac5056Sskrll 3296fac5056Sskrll Two issues remain: what about tokens that arise from macro 33096d60fd4Smrgexpansions, and what happens when we have a long line that overflows the 33196d60fd4Smrgtoken run? 3326fac5056Sskrll 3336fac5056Sskrll Since we promise clients that we preserve the validity of pointers 3346fac5056Sskrllthat we have already returned for tokens that appeared earlier in the 33596d60fd4Smrgline, we cannot reallocate the run. Instead, on overflow it is expanded 33696d60fd4Smrgby chaining a new token run on to the end of the existing one. 3376fac5056Sskrll 3386fac5056Sskrll The tokens forming a macro's replacement list are collected by the 33996d60fd4Smrg'#define' handler, and placed in storage that is only freed by 34096d60fd4Smrg'cpp_destroy'. So if a macro is expanded in the line of tokens, the 3416fac5056Sskrllpointers to the tokens of its expansion that are returned will always 3426fac5056Sskrllremain valid. However, macros are a little trickier than that, since 3436fac5056Sskrllthey give rise to three sources of fresh tokens. They are the built-in 34496d60fd4Smrgmacros like '__LINE__', and the '#' and '##' operators for stringizing 345a41324a9Smrgand token pasting. I handled this by allocating space for these tokens 34696d60fd4Smrgfrom the lexer's token run chain. This means they automatically receive 34796d60fd4Smrgthe same lifetime guarantees as lexed tokens, and we don't need to 34896d60fd4Smrgconcern ourselves with freeing them. 3496fac5056Sskrll 3506fac5056Sskrll Lexing into a line of tokens solves some of the token memory 3516fac5056Sskrllmanagement issues, but not all. The opening parenthesis after a 3526fac5056Sskrllfunction-like macro name might lie on a different line, and the front 3536fac5056Sskrllends definitely want the ability to look ahead past the end of the 3546fac5056Sskrllcurrent line. So cpplib only moves back to the start of the token run 35596d60fd4Smrgat the end of a line if the variable 'keep_tokens' is zero. 3566fac5056SskrllLine-buffering is quite natural for the preprocessor, and as a result 3576fac5056Sskrllthe only time cpplib needs to increment this variable is whilst looking 3586fac5056Sskrllfor the opening parenthesis to, and reading the arguments of, a 35996d60fd4Smrgfunction-like macro. In the near future cpplib will export an interface 36096d60fd4Smrgto increment and decrement this variable, so that clients can share full 36196d60fd4Smrgcontrol over the lifetime of token pointers too. 3626fac5056Sskrll 36396d60fd4Smrg The routine '_cpp_lex_token' handles moving to new token runs, 36496d60fd4Smrgcalling '_cpp_lex_direct' to lex new tokens, or returning 3656fac5056Sskrllpreviously-lexed tokens if we stepped back in the token stream. It also 36696d60fd4Smrgchecks each token for the 'BOL' flag, which might indicate a directive 3676fac5056Sskrllthat needs to be handled, or require a start-of-line call-back to be 36896d60fd4Smrgmade. '_cpp_lex_token' also handles skipping over tokens in failed 3696fac5056Sskrllconditional blocks, and invalidates the control macro of the 3706fac5056Sskrllmultiple-include optimization if a token was successfully lexed outside 3716fac5056Sskrlla directive. In other words, its callers do not need to concern 3726fac5056Sskrllthemselves with such issues. 3736fac5056Sskrll 3746fac5056Sskrll 3756fac5056SskrllFile: cppinternals.info, Node: Hash Nodes, Next: Macro Expansion, Prev: Lexer, Up: Top 3766fac5056Sskrll 3776fac5056SskrllHash Nodes 3786fac5056Sskrll********** 3796fac5056Sskrll 3806fac5056SskrllWhen cpplib encounters an "identifier", it generates a hash code for it 3816fac5056Sskrlland stores it in the hash table. By "identifier" we mean tokens with 38296d60fd4Smrgtype 'CPP_NAME'; this includes identifiers in the usual C sense, as well 38396d60fd4Smrgas keywords, directive names, macro names and so on. For example, all 38496d60fd4Smrgof 'pragma', 'int', 'foo' and '__GNUC__' are identifiers and hashed when 38596d60fd4Smrglexed. 3866fac5056Sskrll 3876fac5056Sskrll Each node in the hash table contain various information about the 3886fac5056Sskrllidentifier it represents. For example, its length and type. At any one 3896fac5056Sskrlltime, each identifier falls into exactly one of three categories: 3906fac5056Sskrll 3916fac5056Sskrll * Macros 3926fac5056Sskrll 3936fac5056Sskrll These have been declared to be macros, either on the command line 39496d60fd4Smrg or with '#define'. A few, such as '__TIME__' are built-ins entered 39596d60fd4Smrg in the hash table during initialization. The hash node for a 39696d60fd4Smrg normal macro points to a structure with more information about the 39796d60fd4Smrg macro, such as whether it is function-like, how many arguments it 39896d60fd4Smrg takes, and its expansion. Built-in macros are flagged as special, 39996d60fd4Smrg and instead contain an enum indicating which of the various 40096d60fd4Smrg built-in macros it is. 4016fac5056Sskrll 4026fac5056Sskrll * Assertions 4036fac5056Sskrll 40496d60fd4Smrg Assertions are in a separate namespace to macros. To enforce this, 40596d60fd4Smrg cpp actually prepends a '#' character before hashing and entering 40696d60fd4Smrg it in the hash table. An assertion's node points to a chain of 40796d60fd4Smrg answers to that assertion. 4086fac5056Sskrll 4096fac5056Sskrll * Void 4106fac5056Sskrll 4116fac5056Sskrll Everything else falls into this category--an identifier that is not 4126fac5056Sskrll currently a macro, or a macro that has since been undefined with 41396d60fd4Smrg '#undef'. 4146fac5056Sskrll 4156fac5056Sskrll When preprocessing C++, this category also includes the named 41696d60fd4Smrg operators, such as 'xor'. In expressions these behave like the 4176fac5056Sskrll operators they represent, but in contexts where the spelling of a 4186fac5056Sskrll token matters they are spelt differently. This spelling 4196fac5056Sskrll distinction is relevant when they are operands of the stringizing 42096d60fd4Smrg and pasting macro operators '#' and '##'. Named operator hash 4216fac5056Sskrll nodes are flagged, both to catch the spelling distinction and to 4226fac5056Sskrll prevent them from being defined as macros. 4236fac5056Sskrll 4246fac5056Sskrll The same identifiers share the same hash node. Since each identifier 4256fac5056Sskrlltoken, after lexing, contains a pointer to its hash node, this is used 4266fac5056Sskrllto provide rapid lookup of various information. For example, when 42796d60fd4Smrgparsing a '#define' statement, CPP flags each argument's identifier hash 42896d60fd4Smrgnode with the index of that argument. This makes duplicated argument 42996d60fd4Smrgchecking an O(1) operation for each argument. Similarly, for each 43096d60fd4Smrgidentifier in the macro's expansion, lookup to see if it is an argument, 43196d60fd4Smrgand which argument it is, is also an O(1) operation. Further, each 43296d60fd4Smrgdirective name, such as 'endif', has an associated directive enum stored 43396d60fd4Smrgin its hash node, so that directive lookup is also O(1). 4346fac5056Sskrll 4356fac5056Sskrll 4366fac5056SskrllFile: cppinternals.info, Node: Macro Expansion, Next: Token Spacing, Prev: Hash Nodes, Up: Top 4376fac5056Sskrll 4386fac5056SskrllMacro Expansion Algorithm 4396fac5056Sskrll************************* 4406fac5056Sskrll 4416fac5056SskrllMacro expansion is a tricky operation, fraught with nasty corner cases 4426fac5056Sskrlland situations that render what you thought was a nifty way to optimize 4436fac5056Sskrllthe preprocessor's expansion algorithm wrong in quite subtle ways. 4446fac5056Sskrll 4456fac5056Sskrll I strongly recommend you have a good grasp of how the C and C++ 44696d60fd4Smrgstandards require macros to be expanded before diving into this section, 44796d60fd4Smrglet alone the code!. If you don't have a clear mental picture of how 44896d60fd4Smrgthings like nested macro expansion, stringizing and token pasting are 44996d60fd4Smrgsupposed to work, damage to your sanity can quickly result. 4506fac5056Sskrll 4516fac5056SskrllInternal representation of macros 4526fac5056Sskrll================================= 4536fac5056Sskrll 4546fac5056SskrllThe preprocessor stores macro expansions in tokenized form. This saves 45596d60fd4Smrgrepeated lexing passes during expansion, at the cost of a small increase 45696d60fd4Smrgin memory consumption on average. The tokens are stored contiguously in 45796d60fd4Smrgmemory, so a pointer to the first one and a token count is all you need 45896d60fd4Smrgto get the replacement list of a macro. 4596fac5056Sskrll 4606fac5056Sskrll If the macro is a function-like macro the preprocessor also stores 4616fac5056Sskrllits parameters, in the form of an ordered list of pointers to the hash 4626fac5056Sskrlltable entry of each parameter's identifier. Further, in the macro's 4636fac5056Sskrllstored expansion each occurrence of a parameter is replaced with a 46496d60fd4Smrgspecial token of type 'CPP_MACRO_ARG'. Each such token holds the index 46596d60fd4Smrgof the parameter it represents in the parameter list, which allows rapid 46696d60fd4Smrgreplacement of parameters with their arguments during expansion. 4676fac5056SskrllDespite this optimization it is still necessary to store the original 46896d60fd4Smrgparameters to the macro, both for dumping with e.g., '-dD', and to warn 4696fac5056Sskrllabout non-trivial macro redefinitions when the parameter names have 4706fac5056Sskrllchanged. 4716fac5056Sskrll 4726fac5056SskrllMacro expansion overview 4736fac5056Sskrll======================== 4746fac5056Sskrll 4756fac5056SskrllThe preprocessor maintains a "context stack", implemented as a linked 47696d60fd4Smrglist of 'cpp_context' structures, which together represent the macro 47796d60fd4Smrgexpansion state at any one time. The 'struct cpp_reader' member 47896d60fd4Smrgvariable 'context' points to the current top of this stack. The top 4796fac5056Sskrllnormally holds the unexpanded replacement list of the innermost macro 4806fac5056Sskrllunder expansion, except when cpplib is about to pre-expand an argument, 4816fac5056Sskrllin which case it holds that argument's unexpanded tokens. 4826fac5056Sskrll 4836fac5056Sskrll When there are no macros under expansion, cpplib is in "base 48496d60fd4Smrgcontext". All contexts other than the base context contain a contiguous 48596d60fd4Smrglist of tokens delimited by a starting and ending token. When not in 48696d60fd4Smrgbase context, cpplib obtains the next token from the list of the top 48796d60fd4Smrgcontext. If there are no tokens left in the list, it pops that context 48896d60fd4Smrgoff the stack, and subsequent ones if necessary, until an unexhausted 48996d60fd4Smrgcontext is found or it returns to base context. In base context, cpplib 49096d60fd4Smrgreads tokens directly from the lexer. 4916fac5056Sskrll 4926fac5056Sskrll If it encounters an identifier that is both a macro and enabled for 4936fac5056Sskrllexpansion, cpplib prepares to push a new context for that macro on the 49496d60fd4Smrgstack by calling the routine 'enter_macro_context'. When this routine 4956fac5056Sskrllreturns, the new context will contain the unexpanded tokens of the 4966fac5056Sskrllreplacement list of that macro. In the case of function-like macros, 49796d60fd4Smrg'enter_macro_context' also replaces any parameters in the replacement 49896d60fd4Smrglist, stored as 'CPP_MACRO_ARG' tokens, with the appropriate macro 4996fac5056Sskrllargument. If the standard requires that the parameter be replaced with 5006fac5056Sskrllits expanded argument, the argument will have been fully macro expanded 5016fac5056Sskrllfirst. 5026fac5056Sskrll 50396d60fd4Smrg 'enter_macro_context' also handles special macros like '__LINE__'. 5046fac5056SskrllAlthough these macros expand to a single token which cannot contain any 50596d60fd4Smrgfurther macros, for reasons of token spacing (*note Token Spacing::) and 50696d60fd4Smrgsimplicity of implementation, cpplib handles these special macros by 50796d60fd4Smrgpushing a context containing just that one token. 5086fac5056Sskrll 50996d60fd4Smrg The final thing that 'enter_macro_context' does before returning is 51096d60fd4Smrgto mark the macro disabled for expansion (except for special macros like 51196d60fd4Smrg'__TIME__'). The macro is re-enabled when its context is later popped 51296d60fd4Smrgfrom the context stack, as described above. This strict ordering 51396d60fd4Smrgensures that a macro is disabled whilst its expansion is being scanned, 51496d60fd4Smrgbut that it is _not_ disabled whilst any arguments to it are being 51596d60fd4Smrgexpanded. 5166fac5056Sskrll 5176fac5056SskrllScanning the replacement list for macros to expand 5186fac5056Sskrll================================================== 5196fac5056Sskrll 52096d60fd4SmrgThe C standard states that, after any parameters have been replaced with 52196d60fd4Smrgtheir possibly-expanded arguments, the replacement list is scanned for 52296d60fd4Smrgnested macros. Further, any identifiers in the replacement list that 52396d60fd4Smrgare not expanded during this scan are never again eligible for expansion 52496d60fd4Smrgin the future, if the reason they were not expanded is that the macro in 52596d60fd4Smrgquestion was disabled. 5266fac5056Sskrll 5276fac5056Sskrll Clearly this latter condition can only apply to tokens resulting from 5286fac5056Sskrllargument pre-expansion. Other tokens never have an opportunity to be 5296fac5056Sskrllre-tested for expansion. It is possible for identifiers that are 5306fac5056Sskrllfunction-like macros to not expand initially but to expand during a 5316fac5056Sskrlllater scan. This occurs when the identifier is the last token of an 5326fac5056Sskrllargument (and therefore originally followed by a comma or a closing 5336fac5056Sskrllparenthesis in its macro's argument list), and when it replaces its 5346fac5056Sskrllparameter in the macro's replacement list, the subsequent token happens 5356fac5056Sskrllto be an opening parenthesis (itself possibly the first token of an 5366fac5056Sskrllargument). 5376fac5056Sskrll 5386fac5056Sskrll It is important to note that when cpplib reads the last token of a 5396fac5056Sskrllgiven context, that context still remains on the stack. Only when 5406fac5056Sskrlllooking for the _next_ token do we pop it off the stack and drop to a 5416fac5056Sskrlllower context. This makes backing up by one token easy, but more 5426fac5056Sskrllimportantly ensures that the macro corresponding to the current context 5436fac5056Sskrllis still disabled when we are considering the last token of its 54496d60fd4Smrgreplacement list for expansion (or indeed expanding it). As an example, 54596d60fd4Smrgwhich illustrates many of the points above, consider 5466fac5056Sskrll 5476fac5056Sskrll #define foo(x) bar x 5486fac5056Sskrll foo(foo) (2) 5496fac5056Sskrll 55096d60fd4Smrgwhich fully expands to 'bar foo (2)'. During pre-expansion of the 55196d60fd4Smrgargument, 'foo' does not expand even though the macro is enabled, since 5526fac5056Sskrllit has no following parenthesis [pre-expansion of an argument only uses 5536fac5056Sskrlltokens from that argument; it cannot take tokens from whatever follows 55496d60fd4Smrgthe macro invocation]. This still leaves the argument token 'foo' 5556fac5056Sskrlleligible for future expansion. Then, when re-scanning after argument 55696d60fd4Smrgreplacement, the token 'foo' is rejected for expansion, and marked 55796d60fd4Smrgineligible for future expansion, since the macro is now disabled. It is 55896d60fd4Smrgdisabled because the replacement list 'bar foo' of the macro is still on 55996d60fd4Smrgthe context stack. 5606fac5056Sskrll 5616fac5056Sskrll If instead the algorithm looked for an opening parenthesis first and 5626fac5056Sskrllthen tested whether the macro were disabled it would be subtly wrong. 56396d60fd4SmrgIn the example above, the replacement list of 'foo' would be popped in 56496d60fd4Smrgthe process of finding the parenthesis, re-enabling 'foo' and expanding 5656fac5056Sskrllit a second time. 5666fac5056Sskrll 5676fac5056SskrllLooking for a function-like macro's opening parenthesis 5686fac5056Sskrll======================================================= 5696fac5056Sskrll 5706fac5056SskrllFunction-like macros only expand when immediately followed by a 5716fac5056Sskrllparenthesis. To do this cpplib needs to temporarily disable macros and 5726fac5056Sskrllread the next token. Unfortunately, because of spacing issues (*note 5736fac5056SskrllToken Spacing::), there can be fake padding tokens in-between, and if 57496d60fd4Smrgthe next real token is not a parenthesis cpplib needs to be able to back 57596d60fd4Smrgup that one token as well as retain the information in any intervening 57696d60fd4Smrgpadding tokens. 5776fac5056Sskrll 5786fac5056Sskrll Backing up more than one token when macros are involved is not 5796fac5056Sskrllpermitted by cpplib, because in general it might involve issues like 5806fac5056Sskrllrestoring popped contexts onto the context stack, which are too hard. 58196d60fd4SmrgInstead, searching for the parenthesis is handled by a special function, 58296d60fd4Smrg'funlike_invocation_p', which remembers padding information as it reads 58396d60fd4Smrgtokens. If the next real token is not an opening parenthesis, it backs 58496d60fd4Smrgup that one token, and then pushes an extra context just containing the 58596d60fd4Smrgpadding information if necessary. 5866fac5056Sskrll 5876fac5056SskrllMarking tokens ineligible for future expansion 5886fac5056Sskrll============================================== 5896fac5056Sskrll 5906fac5056SskrllAs discussed above, cpplib needs a way of marking tokens as 5916fac5056Sskrllunexpandable. Since the tokens cpplib handles are read-only once they 5926fac5056Sskrllhave been lexed, it instead makes a copy of the token and adds the flag 59396d60fd4Smrg'NO_EXPAND' to the copy. 5946fac5056Sskrll 5956fac5056Sskrll For efficiency and to simplify memory management by avoiding having 5966fac5056Sskrllto remember to free these tokens, they are allocated as temporary tokens 5976fac5056Sskrllfrom the lexer's current token run (*note Lexing a line::) using the 59896d60fd4Smrgfunction '_cpp_temp_token'. The tokens are then re-used once the 5996fac5056Sskrllcurrent line of tokens has been read in. 6006fac5056Sskrll 6016fac5056Sskrll This might sound unsafe. However, tokens runs are not re-used at the 6026fac5056Sskrllend of a line if it happens to be in the middle of a macro argument 6036fac5056Sskrlllist, and cpplib only wants to back-up more than one lexer token in 6046fac5056Sskrllsituations where no macro expansion is involved, so the optimization is 6056fac5056Sskrllsafe. 6066fac5056Sskrll 6076fac5056Sskrll 6086fac5056SskrllFile: cppinternals.info, Node: Token Spacing, Next: Line Numbering, Prev: Macro Expansion, Up: Top 6096fac5056Sskrll 6106fac5056SskrllToken Spacing 6116fac5056Sskrll************* 6126fac5056Sskrll 6136fac5056SskrllFirst, consider an issue that only concerns the stand-alone 6146fac5056Sskrllpreprocessor: there needs to be a guarantee that re-reading its 6156fac5056Sskrllpreprocessed output results in an identical token stream. Without 6166fac5056Sskrlltaking special measures, this might not be the case because of macro 6176fac5056Sskrllsubstitution. For example: 6186fac5056Sskrll 6196fac5056Sskrll #define PLUS + 6206fac5056Sskrll #define EMPTY 6216fac5056Sskrll #define f(x) =x= 6226fac5056Sskrll +PLUS -EMPTY- PLUS+ f(=) 6236fac5056Sskrll ==> + + - - + + = = = 6246fac5056Sskrll _not_ 6256fac5056Sskrll ==> ++ -- ++ === 6266fac5056Sskrll 6276fac5056Sskrll One solution would be to simply insert a space between all adjacent 6286fac5056Sskrlltokens. However, we would like to keep space insertion to a minimum, 6296fac5056Sskrllboth for aesthetic reasons and because it causes problems for people who 6306fac5056Sskrllstill try to abuse the preprocessor for things like Fortran source and 6316fac5056SskrllMakefiles. 6326fac5056Sskrll 63396d60fd4Smrg For now, just notice that when tokens are added (or removed, as shown 63496d60fd4Smrgby the 'EMPTY' example) from the original lexed token stream, we need to 63596d60fd4Smrgcheck for accidental token pasting. We call this "paste avoidance". 63696d60fd4SmrgToken addition and removal can only occur because of macro expansion, 63796d60fd4Smrgbut accidental pasting can occur in many places: both before and after 63896d60fd4Smrgeach macro replacement, each argument replacement, and additionally each 63996d60fd4Smrgtoken created by the '#' and '##' operators. 6406fac5056Sskrll 64196d60fd4Smrg Look at how the preprocessor gets whitespace output correct normally. 64296d60fd4SmrgThe 'cpp_token' structure contains a flags byte, and one of those flags 64396d60fd4Smrgis 'PREV_WHITE'. This is flagged by the lexer, and indicates that the 64496d60fd4Smrgtoken was preceded by whitespace of some form other than a new line. 64596d60fd4SmrgThe stand-alone preprocessor can use this flag to decide whether to 64696d60fd4Smrginsert a space between tokens in the output. 6476fac5056Sskrll 6486fac5056Sskrll Now consider the result of the following macro expansion: 6496fac5056Sskrll 6506fac5056Sskrll #define add(x, y, z) x + y +z; 6516fac5056Sskrll sum = add (1,2, 3); 6526fac5056Sskrll ==> sum = 1 + 2 +3; 6536fac5056Sskrll 65496d60fd4Smrg The interesting thing here is that the tokens '1' and '2' are output 65596d60fd4Smrgwith a preceding space, and '3' is output without a preceding space, but 65696d60fd4Smrgwhen lexed none of these tokens had that property. Careful 65796d60fd4Smrgconsideration reveals that '1' gets its preceding whitespace from the 65896d60fd4Smrgspace preceding 'add' in the macro invocation, _not_ replacement list. 65996d60fd4Smrg'2' gets its whitespace from the space preceding the parameter 'y' in 66096d60fd4Smrgthe macro replacement list, and '3' has no preceding space because 66196d60fd4Smrgparameter 'z' has none in the replacement list. 6626fac5056Sskrll 6636fac5056Sskrll Once lexed, tokens are effectively fixed and cannot be altered, since 6646fac5056Sskrllpointers to them might be held in many places, in particular by 6656fac5056Sskrllin-progress macro expansions. So instead of modifying the two tokens 66696d60fd4Smrgabove, the preprocessor inserts a special token, which I call a "padding 66796d60fd4Smrgtoken", into the token stream to indicate that spacing of the subsequent 66896d60fd4Smrgtoken is special. The preprocessor inserts padding tokens in front of 66996d60fd4Smrgevery macro expansion and expanded macro argument. These point to a 67096d60fd4Smrg"source token" from which the subsequent real token should inherit its 67196d60fd4Smrgspacing. In the above example, the source tokens are 'add' in the macro 67296d60fd4Smrginvocation, and 'y' and 'z' in the macro replacement list, respectively. 6736fac5056Sskrll 67496d60fd4Smrg It is quite easy to get multiple padding tokens in a row, for example 67596d60fd4Smrgif a macro's first replacement token expands straight into another 67696d60fd4Smrgmacro. 6776fac5056Sskrll 6786fac5056Sskrll #define foo bar 6796fac5056Sskrll #define bar baz 6806fac5056Sskrll [foo] 6816fac5056Sskrll ==> [baz] 6826fac5056Sskrll 68396d60fd4Smrg Here, two padding tokens are generated with sources the 'foo' token 68496d60fd4Smrgbetween the brackets, and the 'bar' token from foo's replacement list, 68596d60fd4Smrgrespectively. Clearly the first padding token is the one to use, so the 68696d60fd4Smrgoutput code should contain a rule that the first padding token in a 6876fac5056Sskrllsequence is the one that matters. 6886fac5056Sskrll 6896fac5056Sskrll But what if a macro expansion is left? Adjusting the above example 6906fac5056Sskrllslightly: 6916fac5056Sskrll 6926fac5056Sskrll #define foo bar 6936fac5056Sskrll #define bar EMPTY baz 6946fac5056Sskrll #define EMPTY 6956fac5056Sskrll [foo] EMPTY; 6966fac5056Sskrll ==> [ baz] ; 6976fac5056Sskrll 69896d60fd4Smrg As shown, now there should be a space before 'baz' and the semicolon 6996fac5056Sskrllin the output. 7006fac5056Sskrll 70196d60fd4Smrg The rules we decided above fail for 'baz': we generate three padding 70296d60fd4Smrgtokens, one per macro invocation, before the token 'baz'. We would then 70396d60fd4Smrghave it take its spacing from the first of these, which carries source 70496d60fd4Smrgtoken 'foo' with no leading space. 7056fac5056Sskrll 7066fac5056Sskrll It is vital that cpplib get spacing correct in these examples since 707a41324a9Smrgany of these macro expansions could be stringized, where spacing 7086fac5056Sskrllmatters. 7096fac5056Sskrll 7106fac5056Sskrll So, this demonstrates that not just entering macro and argument 7116fac5056Sskrllexpansions, but leaving them requires special handling too. I made 71296d60fd4Smrgcpplib insert a padding token with a 'NULL' source token when leaving 7136fac5056Sskrllmacro expansions, as well as after each replaced argument in a macro's 7146fac5056Sskrllreplacement list. It also inserts appropriate padding tokens on either 71596d60fd4Smrgside of tokens created by the '#' and '##' operators. I expanded the 71696d60fd4Smrgrule so that, if we see a padding token with a 'NULL' source token, 7176fac5056Sskrll_and_ that source token has no leading space, then we behave as if we 7186fac5056Sskrllhave seen no padding tokens at all. A quick check shows this rule will 7196fac5056Sskrllthen get the above example correct as well. 7206fac5056Sskrll 7216fac5056Sskrll Now a relationship with paste avoidance is apparent: we have to be 7226fac5056Sskrllcareful about paste avoidance in exactly the same locations we have 7236fac5056Sskrllpadding tokens in order to get white space correct. This makes 7246fac5056Sskrllimplementation of paste avoidance easy: wherever the stand-alone 7256fac5056Sskrllpreprocessor is fixing up spacing because of padding tokens, and it 7266fac5056Sskrllturns out that no space is needed, it has to take the extra step to 7276fac5056Sskrllcheck that a space is not needed after all to avoid an accidental paste. 72896d60fd4SmrgThe function 'cpp_avoid_paste' advises whether a space is required 7296fac5056Sskrllbetween two consecutive tokens. To avoid excessive spacing, it tries 7306fac5056Sskrllhard to only require a space if one is likely to be necessary, but for 7316fac5056Sskrllreasons of efficiency it is slightly conservative and might recommend a 7326fac5056Sskrllspace where one is not strictly needed. 7336fac5056Sskrll 7346fac5056Sskrll 7356fac5056SskrllFile: cppinternals.info, Node: Line Numbering, Next: Guard Macros, Prev: Token Spacing, Up: Top 7366fac5056Sskrll 7376fac5056SskrllLine numbering 7386fac5056Sskrll************** 7396fac5056Sskrll 7406fac5056SskrllJust which line number anyway? 7416fac5056Sskrll============================== 7426fac5056Sskrll 7436fac5056SskrllThere are three reasonable requirements a cpplib client might have for 7446fac5056Sskrllthe line number of a token passed to it: 7456fac5056Sskrll 7466fac5056Sskrll * The source line it was lexed on. 7476fac5056Sskrll * The line it is output on. This can be different to the line it was 7486fac5056Sskrll lexed on if, for example, there are intervening escaped newlines or 7496fac5056Sskrll C-style comments. For example: 7506fac5056Sskrll 7516fac5056Sskrll foo /* A long 7526fac5056Sskrll comment */ bar \ 7536fac5056Sskrll baz 7546fac5056Sskrll => 7556fac5056Sskrll foo bar baz 7566fac5056Sskrll 7576fac5056Sskrll * If the token results from a macro expansion, the line of the macro 7586fac5056Sskrll name, or possibly the line of the closing parenthesis in the case 7596fac5056Sskrll of function-like macro expansion. 7606fac5056Sskrll 76196d60fd4Smrg The 'cpp_token' structure contains 'line' and 'col' members. The 7626fac5056Sskrlllexer fills these in with the line and column of the first character of 7636fac5056Sskrllthe token. Consequently, but maybe unexpectedly, a token from the 7646fac5056Sskrllreplacement list of a macro expansion carries the location of the token 76596d60fd4Smrgwithin the '#define' directive, because cpplib expands a macro by 7666fac5056Sskrllreturning pointers to the tokens in its replacement list. The current 76796d60fd4Smrgimplementation of cpplib assigns tokens created from built-in macros and 76896d60fd4Smrgthe '#' and '##' operators the location of the most recently lexed 7696fac5056Sskrlltoken. This is a because they are allocated from the lexer's token 7706fac5056Sskrllruns, and because of the way the diagnostic routines infer the 7716fac5056Sskrllappropriate location to report. 7726fac5056Sskrll 7736fac5056Sskrll The diagnostic routines in cpplib display the location of the most 7746fac5056Sskrllrecently _lexed_ token, unless they are passed a specific line and 7756fac5056Sskrllcolumn to report. For diagnostics regarding tokens that arise from 7766fac5056Sskrllmacro expansions, it might also be helpful for the user to see the 7776fac5056Sskrlloriginal location in the macro definition that the token came from. 7786fac5056SskrllSince that is exactly the information each token carries, such an 7796fac5056Sskrllenhancement could be made relatively easily in future. 7806fac5056Sskrll 7816fac5056Sskrll The stand-alone preprocessor faces a similar problem when determining 7826fac5056Sskrllthe correct line to output the token on: the position attached to a 7836fac5056Sskrlltoken is fairly useless if the token came from a macro expansion. All 7846fac5056Sskrlltokens on a logical line should be output on its first physical line, so 7856fac5056Sskrllthe token's reported location is also wrong if it is part of a physical 7866fac5056Sskrllline other than the first. 7876fac5056Sskrll 7886fac5056Sskrll To solve these issues, cpplib provides a callback that is generated 7896fac5056Sskrllwhenever it lexes a preprocessing token that starts a new logical line 79096d60fd4Smrgother than a directive. It passes this token (which may be a 'CPP_EOF' 7916fac5056Sskrlltoken indicating the end of the translation unit) to the callback 79296d60fd4Smrgroutine, which can then use the line and column of this token to produce 79396d60fd4Smrgcorrect output. 7946fac5056Sskrll 7956fac5056SskrllRepresentation of line numbers 7966fac5056Sskrll============================== 7976fac5056Sskrll 7986fac5056SskrllAs mentioned above, cpplib stores with each token the line number that 7996fac5056Sskrllit was lexed on. In fact, this number is not the number of the line in 8006fac5056Sskrllthe source file, but instead bears more resemblance to the number of the 8016fac5056Sskrllline in the translation unit. 8026fac5056Sskrll 8036fac5056Sskrll The preprocessor maintains a monotonic increasing line count, which 8046fac5056Sskrllis incremented at every new line character (and also at the end of any 8056fac5056Sskrllbuffer that does not end in a new line). Since a line number of zero is 8066fac5056Sskrlluseful to indicate certain special states and conditions, this variable 8076fac5056Sskrllstarts counting from one. 8086fac5056Sskrll 8096fac5056Sskrll This variable therefore uniquely enumerates each line in the 8106fac5056Sskrlltranslation unit. With some simple infrastructure, it is straight 8116fac5056Sskrllforward to map from this to the original source file and line number 8126fac5056Sskrllpair, saving space whenever line number information needs to be saved. 813e9e6e0f6SmrgThe code the implements this mapping lies in the files 'line-map.cc' and 81496d60fd4Smrg'line-map.h'. 8156fac5056Sskrll 8166fac5056Sskrll Command-line macros and assertions are implemented by pushing a 81796d60fd4Smrgbuffer containing the right hand side of an equivalent '#define' or 81896d60fd4Smrg'#assert' directive. Some built-in macros are handled similarly. Since 81996d60fd4Smrgthese are all processed before the first line of the main input file, it 82096d60fd4Smrgwill typically have an assigned line closer to twenty than to one. 8216fac5056Sskrll 8226fac5056Sskrll 8236fac5056SskrllFile: cppinternals.info, Node: Guard Macros, Next: Files, Prev: Line Numbering, Up: Top 8246fac5056Sskrll 8256fac5056SskrllThe Multiple-Include Optimization 8266fac5056Sskrll********************************* 8276fac5056Sskrll 8286fac5056SskrllHeader files are often of the form 8296fac5056Sskrll 8306fac5056Sskrll #ifndef FOO 8316fac5056Sskrll #define FOO 8326fac5056Sskrll ... 8336fac5056Sskrll #endif 8346fac5056Sskrll 8356fac5056Sskrllto prevent the compiler from processing them more than once. The 8366fac5056Sskrllpreprocessor notices such header files, so that if the header file 83796d60fd4Smrgappears in a subsequent '#include' directive and 'FOO' is defined, then 8386fac5056Sskrllit is ignored and it doesn't preprocess or even re-open the file a 8396fac5056Sskrllsecond time. This is referred to as the "multiple include 8406fac5056Sskrlloptimization". 8416fac5056Sskrll 8426fac5056Sskrll Under what circumstances is such an optimization valid? If the file 8436fac5056Sskrllwere included a second time, it can only be optimized away if that 8446fac5056Sskrllinclusion would result in no tokens to return, and no relevant 8456fac5056Sskrlldirectives to process. Therefore the current implementation imposes 8466fac5056Sskrllrequirements and makes some allowances as follows: 8476fac5056Sskrll 84896d60fd4Smrg 1. There must be no tokens outside the controlling '#if'-'#endif' 8496fac5056Sskrll pair, but whitespace and comments are permitted. 8506fac5056Sskrll 85196d60fd4Smrg 2. There must be no directives outside the controlling directive pair, 85296d60fd4Smrg but the "null directive" (a line containing nothing other than a 85396d60fd4Smrg single '#' and possibly whitespace) is permitted. 8546fac5056Sskrll 8556fac5056Sskrll 3. The opening directive must be of the form 8566fac5056Sskrll 8576fac5056Sskrll #ifndef FOO 8586fac5056Sskrll 8596fac5056Sskrll or 8606fac5056Sskrll 8616fac5056Sskrll #if !defined FOO [equivalently, #if !defined(FOO)] 8626fac5056Sskrll 86396d60fd4Smrg 4. In the second form above, the tokens forming the '#if' expression 8646fac5056Sskrll must have come directly from the source file--no macro expansion 8656fac5056Sskrll must have been involved. This is because macro definitions can 86696d60fd4Smrg change, and tracking whether or not a relevant change has been made 86796d60fd4Smrg is not worth the implementation cost. 8686fac5056Sskrll 86996d60fd4Smrg 5. There can be no '#else' or '#elif' directives at the outer 8706fac5056Sskrll conditional block level, because they would probably contain 8716fac5056Sskrll something of interest to a subsequent pass. 8726fac5056Sskrll 8736fac5056Sskrll First, when pushing a new file on the buffer stack, 87496d60fd4Smrg'_stack_include_file' sets the controlling macro 'mi_cmacro' to 'NULL', 87596d60fd4Smrgand sets 'mi_valid' to 'true'. This indicates that the preprocessor has 87696d60fd4Smrgnot yet encountered anything that would invalidate the multiple-include 87796d60fd4Smrgoptimization. As described in the next few paragraphs, these two 87896d60fd4Smrgvariables having these values effectively indicates top-of-file. 8796fac5056Sskrll 8806fac5056Sskrll When about to return a token that is not part of a directive, 88196d60fd4Smrg'_cpp_lex_token' sets 'mi_valid' to 'false'. This enforces the 8826fac5056Sskrllconstraint that tokens outside the controlling conditional block 8836fac5056Sskrllinvalidate the optimization. 8846fac5056Sskrll 88596d60fd4Smrg The 'do_if', when appropriate, and 'do_ifndef' directive handlers 88696d60fd4Smrgpass the controlling macro to the function 'push_conditional'. cpplib 8876fac5056Sskrllmaintains a stack of nested conditional blocks, and after processing 88896d60fd4Smrgevery opening conditional this function pushes an 'if_stack' structure 8896fac5056Sskrllonto the stack. In this structure it records the controlling macro for 8906fac5056Sskrllthe block, provided there is one and we're at top-of-file (as described 89196d60fd4Smrgabove). If an '#elif' or '#else' directive is encountered, the 89296d60fd4Smrgcontrolling macro for that block is cleared to 'NULL'. Otherwise, it 89396d60fd4Smrgsurvives until the '#endif' closing the block, upon which 'do_endif' 89496d60fd4Smrgsets 'mi_valid' to true and stores the controlling macro in 'mi_cmacro'. 8956fac5056Sskrll 89696d60fd4Smrg '_cpp_handle_directive' clears 'mi_valid' when processing any 8976fac5056Sskrlldirective other than an opening conditional and the null directive. 8986fac5056SskrllWith this, and requiring top-of-file to record a controlling macro, and 89996d60fd4Smrgno '#else' or '#elif' for it to survive and be copied to 'mi_cmacro' by 90096d60fd4Smrg'do_endif', we have enforced the absence of directives outside the main 9016fac5056Sskrllconditional block for the optimization to be on. 9026fac5056Sskrll 90396d60fd4Smrg Note that whilst we are inside the conditional block, 'mi_valid' is 90496d60fd4Smrglikely to be reset to 'false', but this does not matter since the 90596d60fd4Smrgclosing '#endif' restores it to 'true' if appropriate. 9066fac5056Sskrll 90796d60fd4Smrg Finally, since '_cpp_lex_direct' pops the file off the buffer stack 90896d60fd4Smrgat 'EOF' without returning a token, if the '#endif' directive was not 90996d60fd4Smrgfollowed by any tokens, 'mi_valid' is 'true' and '_cpp_pop_file_buffer' 9106fac5056Sskrllremembers the controlling macro associated with the file. Subsequent 91196d60fd4Smrgcalls to 'stack_include_file' result in no buffer being pushed if the 9126fac5056Sskrllcontrolling macro is defined, effecting the optimization. 9136fac5056Sskrll 9146fac5056Sskrll A quick word on how we handle the 9156fac5056Sskrll 9166fac5056Sskrll #if !defined FOO 9176fac5056Sskrll 91896d60fd4Smrgcase. '_cpp_parse_expr' and 'parse_defined' take steps to see whether 91996d60fd4Smrgthe three stages '!', 'defined-expression' and 'end-of-directive' occur 92096d60fd4Smrgin order in a '#if' expression. If so, they return the guard macro to 92196d60fd4Smrg'do_if' in the variable 'mi_ind_cmacro', and otherwise set it to 'NULL'. 92296d60fd4Smrg'enter_macro_context' sets 'mi_valid' to false, so if a macro was 92396d60fd4Smrgexpanded whilst parsing any part of the expression, then the top-of-file 92496d60fd4Smrgtest in 'push_conditional' fails and the optimization is turned off. 9256fac5056Sskrll 9266fac5056Sskrll 9276fac5056SskrllFile: cppinternals.info, Node: Files, Next: Concept Index, Prev: Guard Macros, Up: Top 9286fac5056Sskrll 9296fac5056SskrllFile Handling 9306fac5056Sskrll************* 9316fac5056Sskrll 9326fac5056SskrllFairly obviously, the file handling code of cpplib resides in the file 933e9e6e0f6Smrg'files.cc'. It takes care of the details of file searching, opening, 9346fac5056Sskrllreading and caching, for both the main source file and all the headers 9356fac5056Sskrllit recursively includes. 9366fac5056Sskrll 9376fac5056Sskrll The basic strategy is to minimize the number of system calls. On 93896d60fd4Smrgmany systems, the basic 'open ()' and 'fstat ()' system calls can be 93996d60fd4Smrgquite expensive. For every '#include'-d file, we need to try all the 9406fac5056Sskrlldirectories in the search path until we find a match. Some projects, 9416fac5056Sskrllsuch as glibc, pass twenty or thirty include paths on the command line, 9426fac5056Sskrllso this can rapidly become time consuming. 9436fac5056Sskrll 9446fac5056Sskrll For a header file we have not encountered before we have little 9456fac5056Sskrllchoice but to do this. However, it is often the case that the same 9466fac5056Sskrllheaders are repeatedly included, and in these cases we try to avoid 9476fac5056Sskrllrepeating the filesystem queries whilst searching for the correct file. 9486fac5056Sskrll 9496fac5056Sskrll For each file we try to open, we store the constructed path in a 9506fac5056Sskrllsplay tree. This path first undergoes simplification by the function 95196d60fd4Smrg'_cpp_simplify_pathname'. For example, '/usr/include/bits/../foo.h' is 95296d60fd4Smrgsimplified to '/usr/include/foo.h' before we enter it in the splay tree 95396d60fd4Smrgand try to 'open ()' the file. CPP will then find subsequent uses of 95496d60fd4Smrg'foo.h', even as '/usr/include/foo.h', in the splay tree and save system 95596d60fd4Smrgcalls. 9566fac5056Sskrll 95796d60fd4Smrg Further, it is likely the file contents have also been cached, saving 95896d60fd4Smrga 'read ()' system call. We don't bother caching the contents of header 95996d60fd4Smrgfiles that are re-inclusion protected, and whose re-inclusion macro is 96096d60fd4Smrgdefined when we leave the header file for the first time. If the host 96196d60fd4Smrgsupports it, we try to map suitably large files into memory, rather than 96296d60fd4Smrgreading them in directly. 9636fac5056Sskrll 9646fac5056Sskrll The include paths are internally stored on a null-terminated 96596d60fd4Smrgsingly-linked list, starting with the '"header.h"' directory search 96696d60fd4Smrgchain, which then links into the '<header.h>' directory chain. 9676fac5056Sskrll 96896d60fd4Smrg Files included with the '<foo.h>' syntax start the lookup directly in 96996d60fd4Smrgthe second half of this chain. However, files included with the 97096d60fd4Smrg'"foo.h"' syntax start at the beginning of the chain, but with one extra 97196d60fd4Smrgdirectory prepended. This is the directory of the current file; the one 97296d60fd4Smrgcontaining the '#include' directive. Prepending this directory on a 97396d60fd4Smrgper-file basis is handled by the function 'search_from'. 9746fac5056Sskrll 9756fac5056Sskrll Note that a header included with a directory component, such as 97696d60fd4Smrg'#include "mydir/foo.h"' and opened as '/usr/local/include/mydir/foo.h', 97796d60fd4Smrgwill have the complete path minus the basename 'foo.h' as the current 97896d60fd4Smrgdirectory. 9796fac5056Sskrll 9806fac5056Sskrll Enough information is stored in the splay tree that CPP can 9816fac5056Sskrllimmediately tell whether it can skip the header file because of the 98296d60fd4Smrgmultiple include optimization, whether the file didn't exist or couldn't 98396d60fd4Smrgbe opened for some reason, or whether the header was flagged not to be 98496d60fd4Smrgre-used, as it is with the obsolete '#import' directive. 9856fac5056Sskrll 9866fac5056Sskrll For the benefit of MS-DOS filesystems with an 8.3 filename 9876fac5056Sskrlllimitation, CPP offers the ability to treat various include file names 9886fac5056Sskrllas aliases for the real header files with shorter names. The map from 98996d60fd4Smrgone to the other is found in a special file called 'header.gcc', stored 9906fac5056Sskrllin the command line (or system) include directories to which the mapping 9916fac5056Sskrllapplies. This may be higher up the directory tree than the full path to 9926fac5056Sskrllthe file minus the base name. 9936fac5056Sskrll 9946fac5056Sskrll 9956fac5056SskrllFile: cppinternals.info, Node: Concept Index, Prev: Files, Up: Top 9966fac5056Sskrll 9976fac5056SskrllConcept Index 9986fac5056Sskrll************* 9996fac5056Sskrll 10006fac5056Sskrll[index] 10016fac5056Sskrll* Menu: 10026fac5056Sskrll 10036fac5056Sskrll* assertions: Hash Nodes. (line 6) 10046fac5056Sskrll* controlling macros: Guard Macros. (line 6) 100596d60fd4Smrg* escaped newlines: Lexer. (line 5) 10066fac5056Sskrll* files: Files. (line 6) 10076fac5056Sskrll* guard macros: Guard Macros. (line 6) 10086fac5056Sskrll* hash table: Hash Nodes. (line 6) 10096fac5056Sskrll* header files: Conventions. (line 6) 10106fac5056Sskrll* identifiers: Hash Nodes. (line 6) 10116fac5056Sskrll* interface: Conventions. (line 6) 10126fac5056Sskrll* lexer: Lexer. (line 6) 101396d60fd4Smrg* line numbers: Line Numbering. (line 5) 10146fac5056Sskrll* macro expansion: Macro Expansion. (line 6) 10156fac5056Sskrll* macro representation (internal): Macro Expansion. (line 19) 10166fac5056Sskrll* macros: Hash Nodes. (line 6) 10176fac5056Sskrll* multiple-include optimization: Guard Macros. (line 6) 10186fac5056Sskrll* named operators: Hash Nodes. (line 6) 10196fac5056Sskrll* newlines: Lexer. (line 6) 10206fac5056Sskrll* paste avoidance: Token Spacing. (line 6) 10216fac5056Sskrll* spacing: Token Spacing. (line 6) 102296d60fd4Smrg* token run: Lexer. (line 191) 10236fac5056Sskrll* token spacing: Token Spacing. (line 6) 10246fac5056Sskrll 10256fac5056Sskrll 10266fac5056Sskrll 10276fac5056SskrllTag Table: 102896d60fd4SmrgNode: Top905 102996d60fd4SmrgNode: Conventions2743 103096d60fd4SmrgNode: Lexer3685 1031e9e6e0f6SmrgRef: Invalid identifiers11600 1032e9e6e0f6SmrgRef: Lexing a line13550 1033e9e6e0f6SmrgNode: Hash Nodes18319 1034e9e6e0f6SmrgNode: Macro Expansion21198 1035e9e6e0f6SmrgNode: Token Spacing30142 1036e9e6e0f6SmrgNode: Line Numbering35998 1037e9e6e0f6SmrgNode: Guard Macros40084 1038e9e6e0f6SmrgNode: Files44875 1039e9e6e0f6SmrgNode: Concept Index48342 10406fac5056Sskrll 10416fac5056SskrllEnd Tag Table 1042*4fe0f936Smrg 1043*4fe0f936Smrg 1044*4fe0f936SmrgLocal Variables: 1045*4fe0f936Smrgcoding: utf-8 1046*4fe0f936SmrgEnd: 1047