1*23f5f463SmrgThis is cppinternals.info, produced by makeinfo version 6.5 from 2a2dc1f3fSmrgcppinternals.texi. 31debfc3dSmrg 41debfc3dSmrgINFO-DIR-SECTION Software development 51debfc3dSmrgSTART-INFO-DIR-ENTRY 61debfc3dSmrg* Cpplib: (cppinternals). Cpplib internals. 71debfc3dSmrgEND-INFO-DIR-ENTRY 81debfc3dSmrg 91debfc3dSmrgThis file documents the internals of the GNU C Preprocessor. 101debfc3dSmrg 118feb0f0bSmrg Copyright (C) 2000-2020 Free Software Foundation, Inc. 121debfc3dSmrg 131debfc3dSmrg Permission is granted to make and distribute verbatim copies of this 141debfc3dSmrgmanual provided the copyright notice and this permission notice are 151debfc3dSmrgpreserved on all copies. 161debfc3dSmrg 171debfc3dSmrg Permission is granted to copy and distribute modified versions of 181debfc3dSmrgthis manual under the conditions for verbatim copying, provided also 19a2dc1f3fSmrgthat the entire resulting derived work is distributed under the terms of 20a2dc1f3fSmrga permission notice identical to this one. 211debfc3dSmrg 221debfc3dSmrg Permission is granted to copy and distribute translations of this 231debfc3dSmrgmanual into another language, under the above conditions for modified 241debfc3dSmrgversions. 251debfc3dSmrg 261debfc3dSmrg 271debfc3dSmrgFile: cppinternals.info, Node: Top, Next: Conventions, Up: (dir) 281debfc3dSmrg 291debfc3dSmrgThe GNU C Preprocessor Internals 301debfc3dSmrg******************************** 311debfc3dSmrg 32a2dc1f3fSmrg* Menu: 33a2dc1f3fSmrg 34a2dc1f3fSmrg* Conventions:: 35a2dc1f3fSmrg* Lexer:: 36a2dc1f3fSmrg* Hash Nodes:: 37a2dc1f3fSmrg* Macro Expansion:: 38a2dc1f3fSmrg* Token Spacing:: 39a2dc1f3fSmrg* Line Numbering:: 40a2dc1f3fSmrg* Guard Macros:: 41a2dc1f3fSmrg* Files:: 42a2dc1f3fSmrg* Concept Index:: 43a2dc1f3fSmrg 441debfc3dSmrg1 Cpplib--the GNU C Preprocessor 451debfc3dSmrg******************************** 461debfc3dSmrg 471debfc3dSmrgThe GNU C preprocessor is implemented as a library, "cpplib", so it can 481debfc3dSmrgbe easily shared between a stand-alone preprocessor, and a preprocessor 491debfc3dSmrgintegrated with the C, C++ and Objective-C front ends. It is also 501debfc3dSmrgavailable for use by other programs, though this is not recommended as 511debfc3dSmrgits exposed interface has not yet reached a point of reasonable 521debfc3dSmrgstability. 531debfc3dSmrg 541debfc3dSmrg The library has been written to be re-entrant, so that it can be used 551debfc3dSmrgto preprocess many files simultaneously if necessary. It has also been 561debfc3dSmrgwritten with the preprocessing token as the fundamental unit; the 571debfc3dSmrgpreprocessor in previous versions of GCC would operate on text strings 581debfc3dSmrgas the fundamental unit. 591debfc3dSmrg 601debfc3dSmrg This brief manual documents the internals of cpplib, and explains 61a2dc1f3fSmrgsome of the tricky issues. It is intended that, along with the comments 62a2dc1f3fSmrgin the source code, a reasonably competent C programmer should be able 63a2dc1f3fSmrgto figure out what the code is doing, and why things have been 641debfc3dSmrgimplemented the way they have. 651debfc3dSmrg 661debfc3dSmrg* Menu: 671debfc3dSmrg 681debfc3dSmrg* Conventions:: Conventions used in the code. 691debfc3dSmrg* Lexer:: The combined C, C++ and Objective-C Lexer. 701debfc3dSmrg* Hash Nodes:: All identifiers are entered into a hash table. 711debfc3dSmrg* Macro Expansion:: Macro expansion algorithm. 721debfc3dSmrg* Token Spacing:: Spacing and paste avoidance issues. 731debfc3dSmrg* Line Numbering:: Tracking location within files. 741debfc3dSmrg* Guard Macros:: Optimizing header files with guard macros. 751debfc3dSmrg* Files:: File handling. 761debfc3dSmrg* Concept Index:: Index. 771debfc3dSmrg 781debfc3dSmrg 791debfc3dSmrgFile: cppinternals.info, Node: Conventions, Next: Lexer, Prev: Top, Up: Top 801debfc3dSmrg 811debfc3dSmrgConventions 821debfc3dSmrg*********** 831debfc3dSmrg 84a2dc1f3fSmrgcpplib has two interfaces--one is exposed internally only, and the other 85a2dc1f3fSmrgis for both internal and external use. 861debfc3dSmrg 871debfc3dSmrg The convention is that functions and types that are exposed to 88a2dc1f3fSmrgmultiple files internally are prefixed with '_cpp_', and are to be found 89a2dc1f3fSmrgin the file 'internal.h'. Functions and types exposed to external 90a2dc1f3fSmrgclients are in 'cpplib.h', and prefixed with 'cpp_'. For historical 911debfc3dSmrgreasons this is no longer quite true, but we should strive to stick to 921debfc3dSmrgit. 931debfc3dSmrg 94a2dc1f3fSmrg We are striving to reduce the information exposed in 'cpplib.h' to 951debfc3dSmrgthe bare minimum necessary, and then to keep it there. This makes clear 961debfc3dSmrgexactly what external clients are entitled to assume, and allows us to 971debfc3dSmrgchange internals in the future without worrying whether library clients 981debfc3dSmrgare perhaps relying on some kind of undocumented implementation-specific 991debfc3dSmrgbehavior. 1001debfc3dSmrg 1011debfc3dSmrg 1021debfc3dSmrgFile: cppinternals.info, Node: Lexer, Next: Hash Nodes, Prev: Conventions, Up: Top 1031debfc3dSmrg 1041debfc3dSmrgThe Lexer 1051debfc3dSmrg********* 1061debfc3dSmrg 1071debfc3dSmrgOverview 1081debfc3dSmrg======== 1091debfc3dSmrg 110a2dc1f3fSmrgThe lexer is contained in the file 'lex.c'. It is a hand-coded lexer, 1111debfc3dSmrgand not implemented as a state machine. It can understand C, C++ and 1121debfc3dSmrgObjective-C source code, and has been extended to allow reasonably 1131debfc3dSmrgsuccessful preprocessing of assembly language. The lexer does not make 1141debfc3dSmrgan initial pass to strip out trigraphs and escaped newlines, but handles 1151debfc3dSmrgthem as they are encountered in a single pass of the input file. It 1161debfc3dSmrgreturns preprocessing tokens individually, not a line at a time. 1171debfc3dSmrg 1181debfc3dSmrg It is mostly transparent to users of the library, since the library's 119a2dc1f3fSmrginterface for obtaining the next token, 'cpp_get_token', takes care of 1201debfc3dSmrglexing new tokens, handling directives, and expanding macros as 1211debfc3dSmrgnecessary. However, the lexer does expose some functionality so that 1221debfc3dSmrgclients of the library can easily spell a given token, such as 123a2dc1f3fSmrg'cpp_spell_token' and 'cpp_token_len'. These functions are useful when 1241debfc3dSmrggenerating diagnostics, and for emitting the preprocessed output. 1251debfc3dSmrg 1261debfc3dSmrgLexing a token 1271debfc3dSmrg============== 1281debfc3dSmrg 129a2dc1f3fSmrgLexing of an individual token is handled by '_cpp_lex_direct' and its 1301debfc3dSmrgsubroutines. In its current form the code is quite complicated, with 1311debfc3dSmrgread ahead characters and such-like, since it strives to not step back 1321debfc3dSmrgin the character stream in preparation for handling non-ASCII file 1331debfc3dSmrgencodings. The current plan is to convert any such files to UTF-8 1341debfc3dSmrgbefore processing them. This complexity is therefore unnecessary and 1351debfc3dSmrgwill be removed, so I'll not discuss it further here. 1361debfc3dSmrg 137a2dc1f3fSmrg The job of '_cpp_lex_direct' is simply to lex a token. It is not 1381debfc3dSmrgresponsible for issues like directive handling, returning lookahead 1391debfc3dSmrgtokens directly, multiple-include optimization, or conditional block 140*23f5f463Smrgskipping. It necessarily has a minor ro^le to play in memory management 141a2dc1f3fSmrgof lexed lines. I discuss these issues in a separate section (*note 142a2dc1f3fSmrgLexing a line::). 1431debfc3dSmrg 1441debfc3dSmrg The lexer places the token it lexes into storage pointed to by the 145a2dc1f3fSmrgvariable 'cur_token', and then increments it. This variable is 1461debfc3dSmrgimportant for correct diagnostic positioning. Unless a specific line 1471debfc3dSmrgand column are passed to the diagnostic routines, they will examine the 148a2dc1f3fSmrg'line' and 'col' values of the token just before the location that 149a2dc1f3fSmrg'cur_token' points to, and use that location to report the diagnostic. 1501debfc3dSmrg 1511debfc3dSmrg The lexer does not consider whitespace to be a token in its own 1521debfc3dSmrgright. If whitespace (other than a new line) precedes a token, it sets 153a2dc1f3fSmrgthe 'PREV_WHITE' bit in the token's flags. Each token has its 'line' 154a2dc1f3fSmrgand 'col' variables set to the line and column of the first character of 155a2dc1f3fSmrgthe token. This line number is the line number in the translation unit, 156a2dc1f3fSmrgand can be converted to a source (file, line) pair using the line map 157a2dc1f3fSmrgcode. 1581debfc3dSmrg 159a2dc1f3fSmrg The first token on a logical, i.e. unescaped, line has the flag 'BOL' 160a2dc1f3fSmrgset for beginning-of-line. This flag is intended for internal use, both 161a2dc1f3fSmrgto distinguish a '#' that begins a directive from one that doesn't, and 162a2dc1f3fSmrgto generate a call-back to clients that want to be notified about the 163a2dc1f3fSmrgstart of every non-directive line with tokens on it. Clients cannot 164a2dc1f3fSmrgreliably determine this for themselves: the first token might be a 165a2dc1f3fSmrgmacro, and the tokens of a macro expansion do not have the 'BOL' flag 166a2dc1f3fSmrgset. The macro expansion may even be empty, and the next token on the 167a2dc1f3fSmrgline certainly won't have the 'BOL' flag set. 1681debfc3dSmrg 1691debfc3dSmrg New lines are treated specially; exactly how the lexer handles them 1701debfc3dSmrgis context-dependent. The C standard mandates that directives are 1711debfc3dSmrgterminated by the first unescaped newline character, even if it appears 1721debfc3dSmrgin the middle of a macro expansion. Therefore, if the state variable 173a2dc1f3fSmrg'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is 174a2dc1f3fSmrgnormally used to indicate end-of-file, to indicate end-of-directive. In 175a2dc1f3fSmrga directive a 'CPP_EOF' token never means end-of-file. Conveniently, if 176a2dc1f3fSmrgthe caller was 'collect_args', it already handles 'CPP_EOF' as if it 177a2dc1f3fSmrgwere end-of-file, and reports an error about an unterminated macro 178a2dc1f3fSmrgargument list. 1791debfc3dSmrg 1801debfc3dSmrg The C standard also specifies that a new line in the middle of the 1811debfc3dSmrgarguments to a macro is treated as whitespace. This white space is 1821debfc3dSmrgimportant in case the macro argument is stringized. The state variable 183a2dc1f3fSmrg'parsing_args' is nonzero when the preprocessor is collecting the 1841debfc3dSmrgarguments to a macro call. It is set to 1 when looking for the opening 1851debfc3dSmrgparenthesis to a function-like macro, and 2 when collecting the actual 1861debfc3dSmrgarguments up to the closing parenthesis, since these two cases need to 1871debfc3dSmrgbe distinguished sometimes. One such time is here: the lexer sets the 188a2dc1f3fSmrg'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args' 1891debfc3dSmrgis set to 2. It doesn't set it if it meets a new line when 190a2dc1f3fSmrg'parsing_args' is 1, since then code like 1911debfc3dSmrg 1921debfc3dSmrg #define foo() bar 1931debfc3dSmrg foo 1941debfc3dSmrg baz 1951debfc3dSmrg 196a2dc1f3fSmrgwould be output with an erroneous space before 'baz': 1971debfc3dSmrg 1981debfc3dSmrg foo 1991debfc3dSmrg baz 2001debfc3dSmrg 2011debfc3dSmrg This is a good example of the subtlety of getting token spacing 2021debfc3dSmrgcorrect in the preprocessor; there are plenty of tests in the testsuite 2031debfc3dSmrgfor corner cases like this. 2041debfc3dSmrg 205a2dc1f3fSmrg The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r' 2061debfc3dSmrgas a single new line indicator. This allows it to transparently 2071debfc3dSmrgpreprocess MS-DOS, Macintosh and Unix files without their needing to 2081debfc3dSmrgpass through a special filter beforehand. 2091debfc3dSmrg 210a2dc1f3fSmrg We also decided to treat a backslash, either '\' or the trigraph 211a2dc1f3fSmrg'??/', separated from one of the above newline indicators by non-comment 212a2dc1f3fSmrgwhitespace only, as intending to escape the newline. It tends to be a 213a2dc1f3fSmrgtyping mistake, and cannot reasonably be mistaken for anything else in 214a2dc1f3fSmrgany of the C-family grammars. Since handling it this way is not 215a2dc1f3fSmrgstrictly conforming to the ISO standard, the library issues a warning 216a2dc1f3fSmrgwherever it encounters it. 2171debfc3dSmrg 2181debfc3dSmrg Handling newlines like this is made simpler by doing it in one place 219a2dc1f3fSmrgonly. The function 'handle_newline' takes care of all newline 220a2dc1f3fSmrgcharacters, and 'skip_escaped_newlines' takes care of arbitrarily long 221a2dc1f3fSmrgsequences of escaped newlines, deferring to 'handle_newline' to handle 2221debfc3dSmrgthe newlines themselves. 2231debfc3dSmrg 2241debfc3dSmrg The most painful aspect of lexing ISO-standard C and C++ is handling 2251debfc3dSmrgtrigraphs and backlash-escaped newlines. Trigraphs are processed before 2261debfc3dSmrgany interpretation of the meaning of a character is made, and 2271debfc3dSmrgunfortunately there is a trigraph representation for a backslash, so it 228a2dc1f3fSmrgis possible for the trigraph '??/' to introduce an escaped newline. 2291debfc3dSmrg 2301debfc3dSmrg Escaped newlines are tedious because theoretically they can occur 231a2dc1f3fSmrganywhere--between the '+' and '=' of the '+=' token, within the 232a2dc1f3fSmrgcharacters of an identifier, and even between the '*' and '/' that 2331debfc3dSmrgterminates a comment. Moreover, you cannot be sure there is just 2341debfc3dSmrgone--there might be an arbitrarily long sequence of them. 2351debfc3dSmrg 236a2dc1f3fSmrg So, for example, the routine that lexes a number, 'parse_number', 2371debfc3dSmrgcannot assume that it can scan forwards until the first non-number 238a2dc1f3fSmrgcharacter and be done with it, because this could be the '\' introducing 239a2dc1f3fSmrgan escaped newline, or the '?' introducing the trigraph sequence that 240a2dc1f3fSmrgrepresents the '\' of an escaped newline. If it encounters a '?' or 241a2dc1f3fSmrg'\', it calls 'skip_escaped_newlines' to skip over any potential escaped 242a2dc1f3fSmrgnewlines before checking whether the number has been finished. 2431debfc3dSmrg 244a2dc1f3fSmrg Similarly code in the main body of '_cpp_lex_direct' cannot simply 245a2dc1f3fSmrgcheck for a '=' after a '+' character to determine whether it has a '+=' 246a2dc1f3fSmrgtoken; it needs to be prepared for an escaped newline of some sort. 247a2dc1f3fSmrgSuch cases use the function 'get_effective_char', which returns the 248a2dc1f3fSmrgfirst character after any intervening escaped newlines. 2491debfc3dSmrg 2501debfc3dSmrg The lexer needs to keep track of the correct column position, 251a2dc1f3fSmrgincluding counting tabs as specified by the '-ftabstop=' option. This 2521debfc3dSmrgshould be done even within C-style comments; they can appear in the 2531debfc3dSmrgmiddle of a line, and we want to report diagnostics in the correct 2541debfc3dSmrgposition for text appearing after the end of the comment. 2551debfc3dSmrg 256a2dc1f3fSmrg Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may 257a2dc1f3fSmrgbe invalid and require a diagnostic. However, if they appear in a macro 258a2dc1f3fSmrgexpansion we don't want to complain with each use of the macro. It is 259a2dc1f3fSmrgtherefore best to catch them during the lexing stage, in 260a2dc1f3fSmrg'parse_identifier'. In both cases, whether a diagnostic is needed or 2611debfc3dSmrgnot is dependent upon the lexer's state. For example, we don't want to 2621debfc3dSmrgissue a diagnostic for re-poisoning a poisoned identifier, or for using 263a2dc1f3fSmrg'__VA_ARGS__' in the expansion of a variable-argument macro. Therefore 264a2dc1f3fSmrg'parse_identifier' makes use of state flags to determine whether a 2651debfc3dSmrgdiagnostic is appropriate. Since we change state on a per-token basis, 2661debfc3dSmrgand don't lex whole lines at a time, this is not a problem. 2671debfc3dSmrg 2681debfc3dSmrg Another place where state flags are used to change behavior is whilst 269a2dc1f3fSmrglexing header names. Normally, a '<' would be lexed as a single token. 270a2dc1f3fSmrgAfter a '#include' directive, though, it should be lexed as a single 271a2dc1f3fSmrgtoken as far as the nearest '>' character. Note that we don't allow the 272a2dc1f3fSmrgterminators of header names to be escaped; the first '"' or '>' 2731debfc3dSmrgterminates the header name. 2741debfc3dSmrg 2751debfc3dSmrg Interpretation of some character sequences depends upon whether we 2761debfc3dSmrgare lexing C, C++ or Objective-C, and on the revision of the standard in 277a2dc1f3fSmrgforce. For example, '::' is a single token in C++, but in C it is two 278a2dc1f3fSmrgseparate ':' tokens and almost certainly a syntax error. Such cases are 279a2dc1f3fSmrghandled by '_cpp_lex_direct' based upon command-line flags stored in the 280a2dc1f3fSmrg'cpp_options' structure. 2811debfc3dSmrg 2821debfc3dSmrg Once a token has been lexed, it leads an independent existence. The 2831debfc3dSmrgspelling of numbers, identifiers and strings is copied to permanent 2841debfc3dSmrgstorage from the original input buffer, so a token remains valid and 285a2dc1f3fSmrgcorrect even if its source buffer is freed with '_cpp_pop_buffer'. The 2861debfc3dSmrgstorage holding the spellings of such tokens remains until the client 2871debfc3dSmrgprogram calls cpp_destroy, probably at the end of the translation unit. 2881debfc3dSmrg 2891debfc3dSmrgLexing a line 2901debfc3dSmrg============= 2911debfc3dSmrg 2921debfc3dSmrgWhen the preprocessor was changed to return pointers to tokens, one 2931debfc3dSmrgfeature I wanted was some sort of guarantee regarding how long a 2941debfc3dSmrgreturned pointer remains valid. This is important to the stand-alone 2951debfc3dSmrgpreprocessor, the future direction of the C family front ends, and even 2961debfc3dSmrgto cpplib itself internally. 2971debfc3dSmrg 2981debfc3dSmrg Occasionally the preprocessor wants to be able to peek ahead in the 2991debfc3dSmrgtoken stream. For example, after the name of a function-like macro, it 3001debfc3dSmrgwants to check the next token to see if it is an opening parenthesis. 3011debfc3dSmrgAnother example is that, after reading the first few tokens of a 302a2dc1f3fSmrg'#pragma' directive and not recognizing it as a registered pragma, it 3031debfc3dSmrgwants to backtrack and allow the user-defined handler for unknown 304a2dc1f3fSmrgpragmas to access the full '#pragma' token stream. The stand-alone 3051debfc3dSmrgpreprocessor wants to be able to test the current token with the 3061debfc3dSmrgprevious one to see if a space needs to be inserted to preserve their 3071debfc3dSmrgseparate tokenization upon re-lexing (paste avoidance), so it needs to 3081debfc3dSmrgbe sure the pointer to the previous token is still valid. The 3091debfc3dSmrgrecursive-descent C++ parser wants to be able to perform tentative 3101debfc3dSmrgparsing arbitrarily far ahead in the token stream, and then to be able 3111debfc3dSmrgto jump back to a prior position in that stream if necessary. 3121debfc3dSmrg 3131debfc3dSmrg The rule I chose, which is fairly natural, is to arrange that the 3141debfc3dSmrgpreprocessor lex all tokens on a line consecutively into a token buffer, 3151debfc3dSmrgwhich I call a "token run", and when meeting an unescaped new line 3161debfc3dSmrg(newlines within comments do not count either), to start lexing back at 317a2dc1f3fSmrgthe beginning of the run. Note that we do _not_ lex a line of tokens at 318a2dc1f3fSmrgonce; if we did that 'parse_identifier' would not have state flags 3191debfc3dSmrgavailable to warn about invalid identifiers (*note Invalid 3201debfc3dSmrgidentifiers::). 3211debfc3dSmrg 3221debfc3dSmrg In other words, accessing tokens that appeared earlier in the current 3231debfc3dSmrgline is valid, but since each logical line overwrites the tokens of the 3241debfc3dSmrgprevious line, tokens from prior lines are unavailable. In particular, 3251debfc3dSmrgsince a directive only occupies a single logical line, this means that 326a2dc1f3fSmrgthe directive handlers like the '#pragma' handler can jump around in the 327a2dc1f3fSmrgdirective's tokens if necessary. 3281debfc3dSmrg 3291debfc3dSmrg Two issues remain: what about tokens that arise from macro 330a2dc1f3fSmrgexpansions, and what happens when we have a long line that overflows the 331a2dc1f3fSmrgtoken run? 3321debfc3dSmrg 3331debfc3dSmrg Since we promise clients that we preserve the validity of pointers 3341debfc3dSmrgthat we have already returned for tokens that appeared earlier in the 335a2dc1f3fSmrgline, we cannot reallocate the run. Instead, on overflow it is expanded 336a2dc1f3fSmrgby chaining a new token run on to the end of the existing one. 3371debfc3dSmrg 3381debfc3dSmrg The tokens forming a macro's replacement list are collected by the 339a2dc1f3fSmrg'#define' handler, and placed in storage that is only freed by 340a2dc1f3fSmrg'cpp_destroy'. So if a macro is expanded in the line of tokens, the 3411debfc3dSmrgpointers to the tokens of its expansion that are returned will always 3421debfc3dSmrgremain valid. However, macros are a little trickier than that, since 3431debfc3dSmrgthey give rise to three sources of fresh tokens. They are the built-in 344a2dc1f3fSmrgmacros like '__LINE__', and the '#' and '##' operators for stringizing 3451debfc3dSmrgand token pasting. I handled this by allocating space for these tokens 346a2dc1f3fSmrgfrom the lexer's token run chain. This means they automatically receive 347a2dc1f3fSmrgthe same lifetime guarantees as lexed tokens, and we don't need to 348a2dc1f3fSmrgconcern ourselves with freeing them. 3491debfc3dSmrg 3501debfc3dSmrg Lexing into a line of tokens solves some of the token memory 3511debfc3dSmrgmanagement issues, but not all. The opening parenthesis after a 3521debfc3dSmrgfunction-like macro name might lie on a different line, and the front 3531debfc3dSmrgends definitely want the ability to look ahead past the end of the 3541debfc3dSmrgcurrent line. So cpplib only moves back to the start of the token run 355a2dc1f3fSmrgat the end of a line if the variable 'keep_tokens' is zero. 3561debfc3dSmrgLine-buffering is quite natural for the preprocessor, and as a result 3571debfc3dSmrgthe only time cpplib needs to increment this variable is whilst looking 3581debfc3dSmrgfor the opening parenthesis to, and reading the arguments of, a 359a2dc1f3fSmrgfunction-like macro. In the near future cpplib will export an interface 360a2dc1f3fSmrgto increment and decrement this variable, so that clients can share full 361a2dc1f3fSmrgcontrol over the lifetime of token pointers too. 3621debfc3dSmrg 363a2dc1f3fSmrg The routine '_cpp_lex_token' handles moving to new token runs, 364a2dc1f3fSmrgcalling '_cpp_lex_direct' to lex new tokens, or returning 3651debfc3dSmrgpreviously-lexed tokens if we stepped back in the token stream. It also 366a2dc1f3fSmrgchecks each token for the 'BOL' flag, which might indicate a directive 3671debfc3dSmrgthat needs to be handled, or require a start-of-line call-back to be 368a2dc1f3fSmrgmade. '_cpp_lex_token' also handles skipping over tokens in failed 3691debfc3dSmrgconditional blocks, and invalidates the control macro of the 3701debfc3dSmrgmultiple-include optimization if a token was successfully lexed outside 3711debfc3dSmrga directive. In other words, its callers do not need to concern 3721debfc3dSmrgthemselves with such issues. 3731debfc3dSmrg 3741debfc3dSmrg 3751debfc3dSmrgFile: cppinternals.info, Node: Hash Nodes, Next: Macro Expansion, Prev: Lexer, Up: Top 3761debfc3dSmrg 3771debfc3dSmrgHash Nodes 3781debfc3dSmrg********** 3791debfc3dSmrg 3801debfc3dSmrgWhen cpplib encounters an "identifier", it generates a hash code for it 3811debfc3dSmrgand stores it in the hash table. By "identifier" we mean tokens with 382a2dc1f3fSmrgtype 'CPP_NAME'; this includes identifiers in the usual C sense, as well 383a2dc1f3fSmrgas keywords, directive names, macro names and so on. For example, all 384a2dc1f3fSmrgof 'pragma', 'int', 'foo' and '__GNUC__' are identifiers and hashed when 385a2dc1f3fSmrglexed. 3861debfc3dSmrg 3871debfc3dSmrg Each node in the hash table contain various information about the 3881debfc3dSmrgidentifier it represents. For example, its length and type. At any one 3891debfc3dSmrgtime, each identifier falls into exactly one of three categories: 3901debfc3dSmrg 3911debfc3dSmrg * Macros 3921debfc3dSmrg 3931debfc3dSmrg These have been declared to be macros, either on the command line 394a2dc1f3fSmrg or with '#define'. A few, such as '__TIME__' are built-ins entered 395a2dc1f3fSmrg in the hash table during initialization. The hash node for a 396a2dc1f3fSmrg normal macro points to a structure with more information about the 397a2dc1f3fSmrg macro, such as whether it is function-like, how many arguments it 398a2dc1f3fSmrg takes, and its expansion. Built-in macros are flagged as special, 399a2dc1f3fSmrg and instead contain an enum indicating which of the various 400a2dc1f3fSmrg built-in macros it is. 4011debfc3dSmrg 4021debfc3dSmrg * Assertions 4031debfc3dSmrg 404a2dc1f3fSmrg Assertions are in a separate namespace to macros. To enforce this, 405a2dc1f3fSmrg cpp actually prepends a '#' character before hashing and entering 406a2dc1f3fSmrg it in the hash table. An assertion's node points to a chain of 407a2dc1f3fSmrg answers to that assertion. 4081debfc3dSmrg 4091debfc3dSmrg * Void 4101debfc3dSmrg 4111debfc3dSmrg Everything else falls into this category--an identifier that is not 4121debfc3dSmrg currently a macro, or a macro that has since been undefined with 413a2dc1f3fSmrg '#undef'. 4141debfc3dSmrg 4151debfc3dSmrg When preprocessing C++, this category also includes the named 416a2dc1f3fSmrg operators, such as 'xor'. In expressions these behave like the 4171debfc3dSmrg operators they represent, but in contexts where the spelling of a 4181debfc3dSmrg token matters they are spelt differently. This spelling 4191debfc3dSmrg distinction is relevant when they are operands of the stringizing 420a2dc1f3fSmrg and pasting macro operators '#' and '##'. Named operator hash 4211debfc3dSmrg nodes are flagged, both to catch the spelling distinction and to 4221debfc3dSmrg prevent them from being defined as macros. 4231debfc3dSmrg 4241debfc3dSmrg The same identifiers share the same hash node. Since each identifier 4251debfc3dSmrgtoken, after lexing, contains a pointer to its hash node, this is used 4261debfc3dSmrgto provide rapid lookup of various information. For example, when 427a2dc1f3fSmrgparsing a '#define' statement, CPP flags each argument's identifier hash 428a2dc1f3fSmrgnode with the index of that argument. This makes duplicated argument 429a2dc1f3fSmrgchecking an O(1) operation for each argument. Similarly, for each 430a2dc1f3fSmrgidentifier in the macro's expansion, lookup to see if it is an argument, 431a2dc1f3fSmrgand which argument it is, is also an O(1) operation. Further, each 432a2dc1f3fSmrgdirective name, such as 'endif', has an associated directive enum stored 433a2dc1f3fSmrgin its hash node, so that directive lookup is also O(1). 4341debfc3dSmrg 4351debfc3dSmrg 4361debfc3dSmrgFile: cppinternals.info, Node: Macro Expansion, Next: Token Spacing, Prev: Hash Nodes, Up: Top 4371debfc3dSmrg 4381debfc3dSmrgMacro Expansion Algorithm 4391debfc3dSmrg************************* 4401debfc3dSmrg 4411debfc3dSmrgMacro expansion is a tricky operation, fraught with nasty corner cases 4421debfc3dSmrgand situations that render what you thought was a nifty way to optimize 4431debfc3dSmrgthe preprocessor's expansion algorithm wrong in quite subtle ways. 4441debfc3dSmrg 4451debfc3dSmrg I strongly recommend you have a good grasp of how the C and C++ 446a2dc1f3fSmrgstandards require macros to be expanded before diving into this section, 447a2dc1f3fSmrglet alone the code!. If you don't have a clear mental picture of how 448a2dc1f3fSmrgthings like nested macro expansion, stringizing and token pasting are 449a2dc1f3fSmrgsupposed to work, damage to your sanity can quickly result. 4501debfc3dSmrg 4511debfc3dSmrgInternal representation of macros 4521debfc3dSmrg================================= 4531debfc3dSmrg 4541debfc3dSmrgThe preprocessor stores macro expansions in tokenized form. This saves 455a2dc1f3fSmrgrepeated lexing passes during expansion, at the cost of a small increase 456a2dc1f3fSmrgin memory consumption on average. The tokens are stored contiguously in 457a2dc1f3fSmrgmemory, so a pointer to the first one and a token count is all you need 458a2dc1f3fSmrgto get the replacement list of a macro. 4591debfc3dSmrg 4601debfc3dSmrg If the macro is a function-like macro the preprocessor also stores 4611debfc3dSmrgits parameters, in the form of an ordered list of pointers to the hash 4621debfc3dSmrgtable entry of each parameter's identifier. Further, in the macro's 4631debfc3dSmrgstored expansion each occurrence of a parameter is replaced with a 464a2dc1f3fSmrgspecial token of type 'CPP_MACRO_ARG'. Each such token holds the index 465a2dc1f3fSmrgof the parameter it represents in the parameter list, which allows rapid 466a2dc1f3fSmrgreplacement of parameters with their arguments during expansion. 4671debfc3dSmrgDespite this optimization it is still necessary to store the original 468a2dc1f3fSmrgparameters to the macro, both for dumping with e.g., '-dD', and to warn 4691debfc3dSmrgabout non-trivial macro redefinitions when the parameter names have 4701debfc3dSmrgchanged. 4711debfc3dSmrg 4721debfc3dSmrgMacro expansion overview 4731debfc3dSmrg======================== 4741debfc3dSmrg 4751debfc3dSmrgThe preprocessor maintains a "context stack", implemented as a linked 476a2dc1f3fSmrglist of 'cpp_context' structures, which together represent the macro 477a2dc1f3fSmrgexpansion state at any one time. The 'struct cpp_reader' member 478a2dc1f3fSmrgvariable 'context' points to the current top of this stack. The top 4791debfc3dSmrgnormally holds the unexpanded replacement list of the innermost macro 4801debfc3dSmrgunder expansion, except when cpplib is about to pre-expand an argument, 4811debfc3dSmrgin which case it holds that argument's unexpanded tokens. 4821debfc3dSmrg 4831debfc3dSmrg When there are no macros under expansion, cpplib is in "base 484a2dc1f3fSmrgcontext". All contexts other than the base context contain a contiguous 485a2dc1f3fSmrglist of tokens delimited by a starting and ending token. When not in 486a2dc1f3fSmrgbase context, cpplib obtains the next token from the list of the top 487a2dc1f3fSmrgcontext. If there are no tokens left in the list, it pops that context 488a2dc1f3fSmrgoff the stack, and subsequent ones if necessary, until an unexhausted 489a2dc1f3fSmrgcontext is found or it returns to base context. In base context, cpplib 490a2dc1f3fSmrgreads tokens directly from the lexer. 4911debfc3dSmrg 4921debfc3dSmrg If it encounters an identifier that is both a macro and enabled for 4931debfc3dSmrgexpansion, cpplib prepares to push a new context for that macro on the 494a2dc1f3fSmrgstack by calling the routine 'enter_macro_context'. When this routine 4951debfc3dSmrgreturns, the new context will contain the unexpanded tokens of the 4961debfc3dSmrgreplacement list of that macro. In the case of function-like macros, 497a2dc1f3fSmrg'enter_macro_context' also replaces any parameters in the replacement 498a2dc1f3fSmrglist, stored as 'CPP_MACRO_ARG' tokens, with the appropriate macro 4991debfc3dSmrgargument. If the standard requires that the parameter be replaced with 5001debfc3dSmrgits expanded argument, the argument will have been fully macro expanded 5011debfc3dSmrgfirst. 5021debfc3dSmrg 503a2dc1f3fSmrg 'enter_macro_context' also handles special macros like '__LINE__'. 5041debfc3dSmrgAlthough these macros expand to a single token which cannot contain any 505a2dc1f3fSmrgfurther macros, for reasons of token spacing (*note Token Spacing::) and 506a2dc1f3fSmrgsimplicity of implementation, cpplib handles these special macros by 507a2dc1f3fSmrgpushing a context containing just that one token. 5081debfc3dSmrg 509a2dc1f3fSmrg The final thing that 'enter_macro_context' does before returning is 510a2dc1f3fSmrgto mark the macro disabled for expansion (except for special macros like 511a2dc1f3fSmrg'__TIME__'). The macro is re-enabled when its context is later popped 512a2dc1f3fSmrgfrom the context stack, as described above. This strict ordering 513a2dc1f3fSmrgensures that a macro is disabled whilst its expansion is being scanned, 514a2dc1f3fSmrgbut that it is _not_ disabled whilst any arguments to it are being 515a2dc1f3fSmrgexpanded. 5161debfc3dSmrg 5171debfc3dSmrgScanning the replacement list for macros to expand 5181debfc3dSmrg================================================== 5191debfc3dSmrg 520a2dc1f3fSmrgThe C standard states that, after any parameters have been replaced with 521a2dc1f3fSmrgtheir possibly-expanded arguments, the replacement list is scanned for 522a2dc1f3fSmrgnested macros. Further, any identifiers in the replacement list that 523a2dc1f3fSmrgare not expanded during this scan are never again eligible for expansion 524a2dc1f3fSmrgin the future, if the reason they were not expanded is that the macro in 525a2dc1f3fSmrgquestion was disabled. 5261debfc3dSmrg 5271debfc3dSmrg Clearly this latter condition can only apply to tokens resulting from 5281debfc3dSmrgargument pre-expansion. Other tokens never have an opportunity to be 5291debfc3dSmrgre-tested for expansion. It is possible for identifiers that are 5301debfc3dSmrgfunction-like macros to not expand initially but to expand during a 5311debfc3dSmrglater scan. This occurs when the identifier is the last token of an 5321debfc3dSmrgargument (and therefore originally followed by a comma or a closing 5331debfc3dSmrgparenthesis in its macro's argument list), and when it replaces its 5341debfc3dSmrgparameter in the macro's replacement list, the subsequent token happens 5351debfc3dSmrgto be an opening parenthesis (itself possibly the first token of an 5361debfc3dSmrgargument). 5371debfc3dSmrg 5381debfc3dSmrg It is important to note that when cpplib reads the last token of a 5391debfc3dSmrggiven context, that context still remains on the stack. Only when 5401debfc3dSmrglooking for the _next_ token do we pop it off the stack and drop to a 5411debfc3dSmrglower context. This makes backing up by one token easy, but more 5421debfc3dSmrgimportantly ensures that the macro corresponding to the current context 5431debfc3dSmrgis still disabled when we are considering the last token of its 544a2dc1f3fSmrgreplacement list for expansion (or indeed expanding it). As an example, 545a2dc1f3fSmrgwhich illustrates many of the points above, consider 5461debfc3dSmrg 5471debfc3dSmrg #define foo(x) bar x 5481debfc3dSmrg foo(foo) (2) 5491debfc3dSmrg 550a2dc1f3fSmrgwhich fully expands to 'bar foo (2)'. During pre-expansion of the 551a2dc1f3fSmrgargument, 'foo' does not expand even though the macro is enabled, since 5521debfc3dSmrgit has no following parenthesis [pre-expansion of an argument only uses 5531debfc3dSmrgtokens from that argument; it cannot take tokens from whatever follows 554a2dc1f3fSmrgthe macro invocation]. This still leaves the argument token 'foo' 5551debfc3dSmrgeligible for future expansion. Then, when re-scanning after argument 556a2dc1f3fSmrgreplacement, the token 'foo' is rejected for expansion, and marked 557a2dc1f3fSmrgineligible for future expansion, since the macro is now disabled. It is 558a2dc1f3fSmrgdisabled because the replacement list 'bar foo' of the macro is still on 559a2dc1f3fSmrgthe context stack. 5601debfc3dSmrg 5611debfc3dSmrg If instead the algorithm looked for an opening parenthesis first and 5621debfc3dSmrgthen tested whether the macro were disabled it would be subtly wrong. 563a2dc1f3fSmrgIn the example above, the replacement list of 'foo' would be popped in 564a2dc1f3fSmrgthe process of finding the parenthesis, re-enabling 'foo' and expanding 5651debfc3dSmrgit a second time. 5661debfc3dSmrg 5671debfc3dSmrgLooking for a function-like macro's opening parenthesis 5681debfc3dSmrg======================================================= 5691debfc3dSmrg 5701debfc3dSmrgFunction-like macros only expand when immediately followed by a 5711debfc3dSmrgparenthesis. To do this cpplib needs to temporarily disable macros and 5721debfc3dSmrgread the next token. Unfortunately, because of spacing issues (*note 5731debfc3dSmrgToken Spacing::), there can be fake padding tokens in-between, and if 574a2dc1f3fSmrgthe next real token is not a parenthesis cpplib needs to be able to back 575a2dc1f3fSmrgup that one token as well as retain the information in any intervening 576a2dc1f3fSmrgpadding tokens. 5771debfc3dSmrg 5781debfc3dSmrg Backing up more than one token when macros are involved is not 5791debfc3dSmrgpermitted by cpplib, because in general it might involve issues like 5801debfc3dSmrgrestoring popped contexts onto the context stack, which are too hard. 581a2dc1f3fSmrgInstead, searching for the parenthesis is handled by a special function, 582a2dc1f3fSmrg'funlike_invocation_p', which remembers padding information as it reads 583a2dc1f3fSmrgtokens. If the next real token is not an opening parenthesis, it backs 584a2dc1f3fSmrgup that one token, and then pushes an extra context just containing the 585a2dc1f3fSmrgpadding information if necessary. 5861debfc3dSmrg 5871debfc3dSmrgMarking tokens ineligible for future expansion 5881debfc3dSmrg============================================== 5891debfc3dSmrg 5901debfc3dSmrgAs discussed above, cpplib needs a way of marking tokens as 5911debfc3dSmrgunexpandable. Since the tokens cpplib handles are read-only once they 5921debfc3dSmrghave been lexed, it instead makes a copy of the token and adds the flag 593a2dc1f3fSmrg'NO_EXPAND' to the copy. 5941debfc3dSmrg 5951debfc3dSmrg For efficiency and to simplify memory management by avoiding having 5961debfc3dSmrgto remember to free these tokens, they are allocated as temporary tokens 5971debfc3dSmrgfrom the lexer's current token run (*note Lexing a line::) using the 598a2dc1f3fSmrgfunction '_cpp_temp_token'. The tokens are then re-used once the 5991debfc3dSmrgcurrent line of tokens has been read in. 6001debfc3dSmrg 6011debfc3dSmrg This might sound unsafe. However, tokens runs are not re-used at the 6021debfc3dSmrgend of a line if it happens to be in the middle of a macro argument 6031debfc3dSmrglist, and cpplib only wants to back-up more than one lexer token in 6041debfc3dSmrgsituations where no macro expansion is involved, so the optimization is 6051debfc3dSmrgsafe. 6061debfc3dSmrg 6071debfc3dSmrg 6081debfc3dSmrgFile: cppinternals.info, Node: Token Spacing, Next: Line Numbering, Prev: Macro Expansion, Up: Top 6091debfc3dSmrg 6101debfc3dSmrgToken Spacing 6111debfc3dSmrg************* 6121debfc3dSmrg 6131debfc3dSmrgFirst, consider an issue that only concerns the stand-alone 6141debfc3dSmrgpreprocessor: there needs to be a guarantee that re-reading its 6151debfc3dSmrgpreprocessed output results in an identical token stream. Without 6161debfc3dSmrgtaking special measures, this might not be the case because of macro 6171debfc3dSmrgsubstitution. For example: 6181debfc3dSmrg 6191debfc3dSmrg #define PLUS + 6201debfc3dSmrg #define EMPTY 6211debfc3dSmrg #define f(x) =x= 6221debfc3dSmrg +PLUS -EMPTY- PLUS+ f(=) 6231debfc3dSmrg ==> + + - - + + = = = 6241debfc3dSmrg _not_ 6251debfc3dSmrg ==> ++ -- ++ === 6261debfc3dSmrg 6271debfc3dSmrg One solution would be to simply insert a space between all adjacent 6281debfc3dSmrgtokens. However, we would like to keep space insertion to a minimum, 6291debfc3dSmrgboth for aesthetic reasons and because it causes problems for people who 6301debfc3dSmrgstill try to abuse the preprocessor for things like Fortran source and 6311debfc3dSmrgMakefiles. 6321debfc3dSmrg 633a2dc1f3fSmrg For now, just notice that when tokens are added (or removed, as shown 634a2dc1f3fSmrgby the 'EMPTY' example) from the original lexed token stream, we need to 635a2dc1f3fSmrgcheck for accidental token pasting. We call this "paste avoidance". 636a2dc1f3fSmrgToken addition and removal can only occur because of macro expansion, 637a2dc1f3fSmrgbut accidental pasting can occur in many places: both before and after 638a2dc1f3fSmrgeach macro replacement, each argument replacement, and additionally each 639a2dc1f3fSmrgtoken created by the '#' and '##' operators. 6401debfc3dSmrg 641a2dc1f3fSmrg Look at how the preprocessor gets whitespace output correct normally. 642a2dc1f3fSmrgThe 'cpp_token' structure contains a flags byte, and one of those flags 643a2dc1f3fSmrgis 'PREV_WHITE'. This is flagged by the lexer, and indicates that the 644a2dc1f3fSmrgtoken was preceded by whitespace of some form other than a new line. 645a2dc1f3fSmrgThe stand-alone preprocessor can use this flag to decide whether to 646a2dc1f3fSmrginsert a space between tokens in the output. 6471debfc3dSmrg 6481debfc3dSmrg Now consider the result of the following macro expansion: 6491debfc3dSmrg 6501debfc3dSmrg #define add(x, y, z) x + y +z; 6511debfc3dSmrg sum = add (1,2, 3); 6521debfc3dSmrg ==> sum = 1 + 2 +3; 6531debfc3dSmrg 654a2dc1f3fSmrg The interesting thing here is that the tokens '1' and '2' are output 655a2dc1f3fSmrgwith a preceding space, and '3' is output without a preceding space, but 656a2dc1f3fSmrgwhen lexed none of these tokens had that property. Careful 657a2dc1f3fSmrgconsideration reveals that '1' gets its preceding whitespace from the 658a2dc1f3fSmrgspace preceding 'add' in the macro invocation, _not_ replacement list. 659a2dc1f3fSmrg'2' gets its whitespace from the space preceding the parameter 'y' in 660a2dc1f3fSmrgthe macro replacement list, and '3' has no preceding space because 661a2dc1f3fSmrgparameter 'z' has none in the replacement list. 6621debfc3dSmrg 6631debfc3dSmrg Once lexed, tokens are effectively fixed and cannot be altered, since 6641debfc3dSmrgpointers to them might be held in many places, in particular by 6651debfc3dSmrgin-progress macro expansions. So instead of modifying the two tokens 666a2dc1f3fSmrgabove, the preprocessor inserts a special token, which I call a "padding 667a2dc1f3fSmrgtoken", into the token stream to indicate that spacing of the subsequent 668a2dc1f3fSmrgtoken is special. The preprocessor inserts padding tokens in front of 669a2dc1f3fSmrgevery macro expansion and expanded macro argument. These point to a 670a2dc1f3fSmrg"source token" from which the subsequent real token should inherit its 671a2dc1f3fSmrgspacing. In the above example, the source tokens are 'add' in the macro 672a2dc1f3fSmrginvocation, and 'y' and 'z' in the macro replacement list, respectively. 6731debfc3dSmrg 674a2dc1f3fSmrg It is quite easy to get multiple padding tokens in a row, for example 675a2dc1f3fSmrgif a macro's first replacement token expands straight into another 676a2dc1f3fSmrgmacro. 6771debfc3dSmrg 6781debfc3dSmrg #define foo bar 6791debfc3dSmrg #define bar baz 6801debfc3dSmrg [foo] 6811debfc3dSmrg ==> [baz] 6821debfc3dSmrg 683a2dc1f3fSmrg Here, two padding tokens are generated with sources the 'foo' token 684a2dc1f3fSmrgbetween the brackets, and the 'bar' token from foo's replacement list, 685a2dc1f3fSmrgrespectively. Clearly the first padding token is the one to use, so the 686a2dc1f3fSmrgoutput code should contain a rule that the first padding token in a 6871debfc3dSmrgsequence is the one that matters. 6881debfc3dSmrg 6891debfc3dSmrg But what if a macro expansion is left? Adjusting the above example 6901debfc3dSmrgslightly: 6911debfc3dSmrg 6921debfc3dSmrg #define foo bar 6931debfc3dSmrg #define bar EMPTY baz 6941debfc3dSmrg #define EMPTY 6951debfc3dSmrg [foo] EMPTY; 6961debfc3dSmrg ==> [ baz] ; 6971debfc3dSmrg 698a2dc1f3fSmrg As shown, now there should be a space before 'baz' and the semicolon 6991debfc3dSmrgin the output. 7001debfc3dSmrg 701a2dc1f3fSmrg The rules we decided above fail for 'baz': we generate three padding 702a2dc1f3fSmrgtokens, one per macro invocation, before the token 'baz'. We would then 703a2dc1f3fSmrghave it take its spacing from the first of these, which carries source 704a2dc1f3fSmrgtoken 'foo' with no leading space. 7051debfc3dSmrg 7061debfc3dSmrg It is vital that cpplib get spacing correct in these examples since 7071debfc3dSmrgany of these macro expansions could be stringized, where spacing 7081debfc3dSmrgmatters. 7091debfc3dSmrg 7101debfc3dSmrg So, this demonstrates that not just entering macro and argument 7111debfc3dSmrgexpansions, but leaving them requires special handling too. I made 712a2dc1f3fSmrgcpplib insert a padding token with a 'NULL' source token when leaving 7131debfc3dSmrgmacro expansions, as well as after each replaced argument in a macro's 7141debfc3dSmrgreplacement list. It also inserts appropriate padding tokens on either 715a2dc1f3fSmrgside of tokens created by the '#' and '##' operators. I expanded the 716a2dc1f3fSmrgrule so that, if we see a padding token with a 'NULL' source token, 7171debfc3dSmrg_and_ that source token has no leading space, then we behave as if we 7181debfc3dSmrghave seen no padding tokens at all. A quick check shows this rule will 7191debfc3dSmrgthen get the above example correct as well. 7201debfc3dSmrg 7211debfc3dSmrg Now a relationship with paste avoidance is apparent: we have to be 7221debfc3dSmrgcareful about paste avoidance in exactly the same locations we have 7231debfc3dSmrgpadding tokens in order to get white space correct. This makes 7241debfc3dSmrgimplementation of paste avoidance easy: wherever the stand-alone 7251debfc3dSmrgpreprocessor is fixing up spacing because of padding tokens, and it 7261debfc3dSmrgturns out that no space is needed, it has to take the extra step to 7271debfc3dSmrgcheck that a space is not needed after all to avoid an accidental paste. 728a2dc1f3fSmrgThe function 'cpp_avoid_paste' advises whether a space is required 7291debfc3dSmrgbetween two consecutive tokens. To avoid excessive spacing, it tries 7301debfc3dSmrghard to only require a space if one is likely to be necessary, but for 7311debfc3dSmrgreasons of efficiency it is slightly conservative and might recommend a 7321debfc3dSmrgspace where one is not strictly needed. 7331debfc3dSmrg 7341debfc3dSmrg 7351debfc3dSmrgFile: cppinternals.info, Node: Line Numbering, Next: Guard Macros, Prev: Token Spacing, Up: Top 7361debfc3dSmrg 7371debfc3dSmrgLine numbering 7381debfc3dSmrg************** 7391debfc3dSmrg 7401debfc3dSmrgJust which line number anyway? 7411debfc3dSmrg============================== 7421debfc3dSmrg 7431debfc3dSmrgThere are three reasonable requirements a cpplib client might have for 7441debfc3dSmrgthe line number of a token passed to it: 7451debfc3dSmrg 7461debfc3dSmrg * The source line it was lexed on. 7471debfc3dSmrg * The line it is output on. This can be different to the line it was 7481debfc3dSmrg lexed on if, for example, there are intervening escaped newlines or 7491debfc3dSmrg C-style comments. For example: 7501debfc3dSmrg 7511debfc3dSmrg foo /* A long 7521debfc3dSmrg comment */ bar \ 7531debfc3dSmrg baz 7541debfc3dSmrg => 7551debfc3dSmrg foo bar baz 7561debfc3dSmrg 7571debfc3dSmrg * If the token results from a macro expansion, the line of the macro 7581debfc3dSmrg name, or possibly the line of the closing parenthesis in the case 7591debfc3dSmrg of function-like macro expansion. 7601debfc3dSmrg 761a2dc1f3fSmrg The 'cpp_token' structure contains 'line' and 'col' members. The 7621debfc3dSmrglexer fills these in with the line and column of the first character of 7631debfc3dSmrgthe token. Consequently, but maybe unexpectedly, a token from the 7641debfc3dSmrgreplacement list of a macro expansion carries the location of the token 765a2dc1f3fSmrgwithin the '#define' directive, because cpplib expands a macro by 7661debfc3dSmrgreturning pointers to the tokens in its replacement list. The current 767a2dc1f3fSmrgimplementation of cpplib assigns tokens created from built-in macros and 768a2dc1f3fSmrgthe '#' and '##' operators the location of the most recently lexed 7691debfc3dSmrgtoken. This is a because they are allocated from the lexer's token 7701debfc3dSmrgruns, and because of the way the diagnostic routines infer the 7711debfc3dSmrgappropriate location to report. 7721debfc3dSmrg 7731debfc3dSmrg The diagnostic routines in cpplib display the location of the most 7741debfc3dSmrgrecently _lexed_ token, unless they are passed a specific line and 7751debfc3dSmrgcolumn to report. For diagnostics regarding tokens that arise from 7761debfc3dSmrgmacro expansions, it might also be helpful for the user to see the 7771debfc3dSmrgoriginal location in the macro definition that the token came from. 7781debfc3dSmrgSince that is exactly the information each token carries, such an 7791debfc3dSmrgenhancement could be made relatively easily in future. 7801debfc3dSmrg 7811debfc3dSmrg The stand-alone preprocessor faces a similar problem when determining 7821debfc3dSmrgthe correct line to output the token on: the position attached to a 7831debfc3dSmrgtoken is fairly useless if the token came from a macro expansion. All 7841debfc3dSmrgtokens on a logical line should be output on its first physical line, so 7851debfc3dSmrgthe token's reported location is also wrong if it is part of a physical 7861debfc3dSmrgline other than the first. 7871debfc3dSmrg 7881debfc3dSmrg To solve these issues, cpplib provides a callback that is generated 7891debfc3dSmrgwhenever it lexes a preprocessing token that starts a new logical line 790a2dc1f3fSmrgother than a directive. It passes this token (which may be a 'CPP_EOF' 7911debfc3dSmrgtoken indicating the end of the translation unit) to the callback 792a2dc1f3fSmrgroutine, which can then use the line and column of this token to produce 793a2dc1f3fSmrgcorrect output. 7941debfc3dSmrg 7951debfc3dSmrgRepresentation of line numbers 7961debfc3dSmrg============================== 7971debfc3dSmrg 7981debfc3dSmrgAs mentioned above, cpplib stores with each token the line number that 7991debfc3dSmrgit was lexed on. In fact, this number is not the number of the line in 8001debfc3dSmrgthe source file, but instead bears more resemblance to the number of the 8011debfc3dSmrgline in the translation unit. 8021debfc3dSmrg 8031debfc3dSmrg The preprocessor maintains a monotonic increasing line count, which 8041debfc3dSmrgis incremented at every new line character (and also at the end of any 8051debfc3dSmrgbuffer that does not end in a new line). Since a line number of zero is 8061debfc3dSmrguseful to indicate certain special states and conditions, this variable 8071debfc3dSmrgstarts counting from one. 8081debfc3dSmrg 8091debfc3dSmrg This variable therefore uniquely enumerates each line in the 8101debfc3dSmrgtranslation unit. With some simple infrastructure, it is straight 8111debfc3dSmrgforward to map from this to the original source file and line number 8121debfc3dSmrgpair, saving space whenever line number information needs to be saved. 813a2dc1f3fSmrgThe code the implements this mapping lies in the files 'line-map.c' and 814a2dc1f3fSmrg'line-map.h'. 8151debfc3dSmrg 8161debfc3dSmrg Command-line macros and assertions are implemented by pushing a 817a2dc1f3fSmrgbuffer containing the right hand side of an equivalent '#define' or 818a2dc1f3fSmrg'#assert' directive. Some built-in macros are handled similarly. Since 819a2dc1f3fSmrgthese are all processed before the first line of the main input file, it 820a2dc1f3fSmrgwill typically have an assigned line closer to twenty than to one. 8211debfc3dSmrg 8221debfc3dSmrg 8231debfc3dSmrgFile: cppinternals.info, Node: Guard Macros, Next: Files, Prev: Line Numbering, Up: Top 8241debfc3dSmrg 8251debfc3dSmrgThe Multiple-Include Optimization 8261debfc3dSmrg********************************* 8271debfc3dSmrg 8281debfc3dSmrgHeader files are often of the form 8291debfc3dSmrg 8301debfc3dSmrg #ifndef FOO 8311debfc3dSmrg #define FOO 8321debfc3dSmrg ... 8331debfc3dSmrg #endif 8341debfc3dSmrg 8351debfc3dSmrgto prevent the compiler from processing them more than once. The 8361debfc3dSmrgpreprocessor notices such header files, so that if the header file 837a2dc1f3fSmrgappears in a subsequent '#include' directive and 'FOO' is defined, then 8381debfc3dSmrgit is ignored and it doesn't preprocess or even re-open the file a 8391debfc3dSmrgsecond time. This is referred to as the "multiple include 8401debfc3dSmrgoptimization". 8411debfc3dSmrg 8421debfc3dSmrg Under what circumstances is such an optimization valid? If the file 8431debfc3dSmrgwere included a second time, it can only be optimized away if that 8441debfc3dSmrginclusion would result in no tokens to return, and no relevant 8451debfc3dSmrgdirectives to process. Therefore the current implementation imposes 8461debfc3dSmrgrequirements and makes some allowances as follows: 8471debfc3dSmrg 848a2dc1f3fSmrg 1. There must be no tokens outside the controlling '#if'-'#endif' 8491debfc3dSmrg pair, but whitespace and comments are permitted. 8501debfc3dSmrg 851a2dc1f3fSmrg 2. There must be no directives outside the controlling directive pair, 852a2dc1f3fSmrg but the "null directive" (a line containing nothing other than a 853a2dc1f3fSmrg single '#' and possibly whitespace) is permitted. 8541debfc3dSmrg 8551debfc3dSmrg 3. The opening directive must be of the form 8561debfc3dSmrg 8571debfc3dSmrg #ifndef FOO 8581debfc3dSmrg 8591debfc3dSmrg or 8601debfc3dSmrg 8611debfc3dSmrg #if !defined FOO [equivalently, #if !defined(FOO)] 8621debfc3dSmrg 863a2dc1f3fSmrg 4. In the second form above, the tokens forming the '#if' expression 8641debfc3dSmrg must have come directly from the source file--no macro expansion 8651debfc3dSmrg must have been involved. This is because macro definitions can 866a2dc1f3fSmrg change, and tracking whether or not a relevant change has been made 867a2dc1f3fSmrg is not worth the implementation cost. 8681debfc3dSmrg 869a2dc1f3fSmrg 5. There can be no '#else' or '#elif' directives at the outer 8701debfc3dSmrg conditional block level, because they would probably contain 8711debfc3dSmrg something of interest to a subsequent pass. 8721debfc3dSmrg 8731debfc3dSmrg First, when pushing a new file on the buffer stack, 874a2dc1f3fSmrg'_stack_include_file' sets the controlling macro 'mi_cmacro' to 'NULL', 875a2dc1f3fSmrgand sets 'mi_valid' to 'true'. This indicates that the preprocessor has 876a2dc1f3fSmrgnot yet encountered anything that would invalidate the multiple-include 877a2dc1f3fSmrgoptimization. As described in the next few paragraphs, these two 878a2dc1f3fSmrgvariables having these values effectively indicates top-of-file. 8791debfc3dSmrg 8801debfc3dSmrg When about to return a token that is not part of a directive, 881a2dc1f3fSmrg'_cpp_lex_token' sets 'mi_valid' to 'false'. This enforces the 8821debfc3dSmrgconstraint that tokens outside the controlling conditional block 8831debfc3dSmrginvalidate the optimization. 8841debfc3dSmrg 885a2dc1f3fSmrg The 'do_if', when appropriate, and 'do_ifndef' directive handlers 886a2dc1f3fSmrgpass the controlling macro to the function 'push_conditional'. cpplib 8871debfc3dSmrgmaintains a stack of nested conditional blocks, and after processing 888a2dc1f3fSmrgevery opening conditional this function pushes an 'if_stack' structure 8891debfc3dSmrgonto the stack. In this structure it records the controlling macro for 8901debfc3dSmrgthe block, provided there is one and we're at top-of-file (as described 891a2dc1f3fSmrgabove). If an '#elif' or '#else' directive is encountered, the 892a2dc1f3fSmrgcontrolling macro for that block is cleared to 'NULL'. Otherwise, it 893a2dc1f3fSmrgsurvives until the '#endif' closing the block, upon which 'do_endif' 894a2dc1f3fSmrgsets 'mi_valid' to true and stores the controlling macro in 'mi_cmacro'. 8951debfc3dSmrg 896a2dc1f3fSmrg '_cpp_handle_directive' clears 'mi_valid' when processing any 8971debfc3dSmrgdirective other than an opening conditional and the null directive. 8981debfc3dSmrgWith this, and requiring top-of-file to record a controlling macro, and 899a2dc1f3fSmrgno '#else' or '#elif' for it to survive and be copied to 'mi_cmacro' by 900a2dc1f3fSmrg'do_endif', we have enforced the absence of directives outside the main 9011debfc3dSmrgconditional block for the optimization to be on. 9021debfc3dSmrg 903a2dc1f3fSmrg Note that whilst we are inside the conditional block, 'mi_valid' is 904a2dc1f3fSmrglikely to be reset to 'false', but this does not matter since the 905a2dc1f3fSmrgclosing '#endif' restores it to 'true' if appropriate. 9061debfc3dSmrg 907a2dc1f3fSmrg Finally, since '_cpp_lex_direct' pops the file off the buffer stack 908a2dc1f3fSmrgat 'EOF' without returning a token, if the '#endif' directive was not 909a2dc1f3fSmrgfollowed by any tokens, 'mi_valid' is 'true' and '_cpp_pop_file_buffer' 9101debfc3dSmrgremembers the controlling macro associated with the file. Subsequent 911a2dc1f3fSmrgcalls to 'stack_include_file' result in no buffer being pushed if the 9121debfc3dSmrgcontrolling macro is defined, effecting the optimization. 9131debfc3dSmrg 9141debfc3dSmrg A quick word on how we handle the 9151debfc3dSmrg 9161debfc3dSmrg #if !defined FOO 9171debfc3dSmrg 918a2dc1f3fSmrgcase. '_cpp_parse_expr' and 'parse_defined' take steps to see whether 919a2dc1f3fSmrgthe three stages '!', 'defined-expression' and 'end-of-directive' occur 920a2dc1f3fSmrgin order in a '#if' expression. If so, they return the guard macro to 921a2dc1f3fSmrg'do_if' in the variable 'mi_ind_cmacro', and otherwise set it to 'NULL'. 922a2dc1f3fSmrg'enter_macro_context' sets 'mi_valid' to false, so if a macro was 923a2dc1f3fSmrgexpanded whilst parsing any part of the expression, then the top-of-file 924a2dc1f3fSmrgtest in 'push_conditional' fails and the optimization is turned off. 9251debfc3dSmrg 9261debfc3dSmrg 9271debfc3dSmrgFile: cppinternals.info, Node: Files, Next: Concept Index, Prev: Guard Macros, Up: Top 9281debfc3dSmrg 9291debfc3dSmrgFile Handling 9301debfc3dSmrg************* 9311debfc3dSmrg 9321debfc3dSmrgFairly obviously, the file handling code of cpplib resides in the file 933a2dc1f3fSmrg'files.c'. It takes care of the details of file searching, opening, 9341debfc3dSmrgreading and caching, for both the main source file and all the headers 9351debfc3dSmrgit recursively includes. 9361debfc3dSmrg 9371debfc3dSmrg The basic strategy is to minimize the number of system calls. On 938a2dc1f3fSmrgmany systems, the basic 'open ()' and 'fstat ()' system calls can be 939a2dc1f3fSmrgquite expensive. For every '#include'-d file, we need to try all the 9401debfc3dSmrgdirectories in the search path until we find a match. Some projects, 9411debfc3dSmrgsuch as glibc, pass twenty or thirty include paths on the command line, 9421debfc3dSmrgso this can rapidly become time consuming. 9431debfc3dSmrg 9441debfc3dSmrg For a header file we have not encountered before we have little 9451debfc3dSmrgchoice but to do this. However, it is often the case that the same 9461debfc3dSmrgheaders are repeatedly included, and in these cases we try to avoid 9471debfc3dSmrgrepeating the filesystem queries whilst searching for the correct file. 9481debfc3dSmrg 9491debfc3dSmrg For each file we try to open, we store the constructed path in a 9501debfc3dSmrgsplay tree. This path first undergoes simplification by the function 951a2dc1f3fSmrg'_cpp_simplify_pathname'. For example, '/usr/include/bits/../foo.h' is 952a2dc1f3fSmrgsimplified to '/usr/include/foo.h' before we enter it in the splay tree 953a2dc1f3fSmrgand try to 'open ()' the file. CPP will then find subsequent uses of 954a2dc1f3fSmrg'foo.h', even as '/usr/include/foo.h', in the splay tree and save system 955a2dc1f3fSmrgcalls. 9561debfc3dSmrg 957a2dc1f3fSmrg Further, it is likely the file contents have also been cached, saving 958a2dc1f3fSmrga 'read ()' system call. We don't bother caching the contents of header 959a2dc1f3fSmrgfiles that are re-inclusion protected, and whose re-inclusion macro is 960a2dc1f3fSmrgdefined when we leave the header file for the first time. If the host 961a2dc1f3fSmrgsupports it, we try to map suitably large files into memory, rather than 962a2dc1f3fSmrgreading them in directly. 9631debfc3dSmrg 9641debfc3dSmrg The include paths are internally stored on a null-terminated 965a2dc1f3fSmrgsingly-linked list, starting with the '"header.h"' directory search 966a2dc1f3fSmrgchain, which then links into the '<header.h>' directory chain. 9671debfc3dSmrg 968a2dc1f3fSmrg Files included with the '<foo.h>' syntax start the lookup directly in 969a2dc1f3fSmrgthe second half of this chain. However, files included with the 970a2dc1f3fSmrg'"foo.h"' syntax start at the beginning of the chain, but with one extra 971a2dc1f3fSmrgdirectory prepended. This is the directory of the current file; the one 972a2dc1f3fSmrgcontaining the '#include' directive. Prepending this directory on a 973a2dc1f3fSmrgper-file basis is handled by the function 'search_from'. 9741debfc3dSmrg 9751debfc3dSmrg Note that a header included with a directory component, such as 976a2dc1f3fSmrg'#include "mydir/foo.h"' and opened as '/usr/local/include/mydir/foo.h', 977a2dc1f3fSmrgwill have the complete path minus the basename 'foo.h' as the current 978a2dc1f3fSmrgdirectory. 9791debfc3dSmrg 9801debfc3dSmrg Enough information is stored in the splay tree that CPP can 9811debfc3dSmrgimmediately tell whether it can skip the header file because of the 982a2dc1f3fSmrgmultiple include optimization, whether the file didn't exist or couldn't 983a2dc1f3fSmrgbe opened for some reason, or whether the header was flagged not to be 984a2dc1f3fSmrgre-used, as it is with the obsolete '#import' directive. 9851debfc3dSmrg 9861debfc3dSmrg For the benefit of MS-DOS filesystems with an 8.3 filename 9871debfc3dSmrglimitation, CPP offers the ability to treat various include file names 9881debfc3dSmrgas aliases for the real header files with shorter names. The map from 989a2dc1f3fSmrgone to the other is found in a special file called 'header.gcc', stored 9901debfc3dSmrgin the command line (or system) include directories to which the mapping 9911debfc3dSmrgapplies. This may be higher up the directory tree than the full path to 9921debfc3dSmrgthe file minus the base name. 9931debfc3dSmrg 9941debfc3dSmrg 9951debfc3dSmrgFile: cppinternals.info, Node: Concept Index, Prev: Files, Up: Top 9961debfc3dSmrg 9971debfc3dSmrgConcept Index 9981debfc3dSmrg************* 9991debfc3dSmrg 10001debfc3dSmrg[index] 10011debfc3dSmrg* Menu: 10021debfc3dSmrg 10031debfc3dSmrg* assertions: Hash Nodes. (line 6) 10041debfc3dSmrg* controlling macros: Guard Macros. (line 6) 1005a2dc1f3fSmrg* escaped newlines: Lexer. (line 5) 10061debfc3dSmrg* files: Files. (line 6) 10071debfc3dSmrg* guard macros: Guard Macros. (line 6) 10081debfc3dSmrg* hash table: Hash Nodes. (line 6) 10091debfc3dSmrg* header files: Conventions. (line 6) 10101debfc3dSmrg* identifiers: Hash Nodes. (line 6) 10111debfc3dSmrg* interface: Conventions. (line 6) 10121debfc3dSmrg* lexer: Lexer. (line 6) 1013a2dc1f3fSmrg* line numbers: Line Numbering. (line 5) 10141debfc3dSmrg* macro expansion: Macro Expansion. (line 6) 10151debfc3dSmrg* macro representation (internal): Macro Expansion. (line 19) 10161debfc3dSmrg* macros: Hash Nodes. (line 6) 10171debfc3dSmrg* multiple-include optimization: Guard Macros. (line 6) 10181debfc3dSmrg* named operators: Hash Nodes. (line 6) 10191debfc3dSmrg* newlines: Lexer. (line 6) 10201debfc3dSmrg* paste avoidance: Token Spacing. (line 6) 10211debfc3dSmrg* spacing: Token Spacing. (line 6) 1022a2dc1f3fSmrg* token run: Lexer. (line 191) 10231debfc3dSmrg* token spacing: Token Spacing. (line 6) 10241debfc3dSmrg 10251debfc3dSmrg 10261debfc3dSmrg 10271debfc3dSmrgTag Table: 1028a2dc1f3fSmrgNode: Top905 1029a2dc1f3fSmrgNode: Conventions2743 1030a2dc1f3fSmrgNode: Lexer3685 1031a2dc1f3fSmrgRef: Invalid identifiers11599 1032a2dc1f3fSmrgRef: Lexing a line13549 1033a2dc1f3fSmrgNode: Hash Nodes18318 1034a2dc1f3fSmrgNode: Macro Expansion21197 1035a2dc1f3fSmrgNode: Token Spacing30141 1036a2dc1f3fSmrgNode: Line Numbering35997 1037a2dc1f3fSmrgNode: Guard Macros40082 1038a2dc1f3fSmrgNode: Files44873 1039a2dc1f3fSmrgNode: Concept Index48339 10401debfc3dSmrg 10411debfc3dSmrgEnd Tag Table 1042