1*e4b17023SJohn Marino\input texinfo 2*e4b17023SJohn Marino@setfilename cppinternals.info 3*e4b17023SJohn Marino@settitle The GNU C Preprocessor Internals 4*e4b17023SJohn Marino 5*e4b17023SJohn Marino@include gcc-common.texi 6*e4b17023SJohn Marino 7*e4b17023SJohn Marino@ifinfo 8*e4b17023SJohn Marino@dircategory Software development 9*e4b17023SJohn Marino@direntry 10*e4b17023SJohn Marino* Cpplib: (cppinternals). Cpplib internals. 11*e4b17023SJohn Marino@end direntry 12*e4b17023SJohn Marino@end ifinfo 13*e4b17023SJohn Marino 14*e4b17023SJohn Marino@c @smallbook 15*e4b17023SJohn Marino@c @cropmarks 16*e4b17023SJohn Marino@c @finalout 17*e4b17023SJohn Marino@setchapternewpage odd 18*e4b17023SJohn Marino@ifinfo 19*e4b17023SJohn MarinoThis file documents the internals of the GNU C Preprocessor. 20*e4b17023SJohn Marino 21*e4b17023SJohn MarinoCopyright 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software 22*e4b17023SJohn MarinoFoundation, Inc. 23*e4b17023SJohn Marino 24*e4b17023SJohn MarinoPermission is granted to make and distribute verbatim copies of 25*e4b17023SJohn Marinothis manual provided the copyright notice and this permission notice 26*e4b17023SJohn Marinoare preserved on all copies. 27*e4b17023SJohn Marino 28*e4b17023SJohn Marino@ignore 29*e4b17023SJohn MarinoPermission is granted to process this file through Tex and print the 30*e4b17023SJohn Marinoresults, provided the printed document carries copying permission 31*e4b17023SJohn Marinonotice identical to this one except for the removal of this paragraph 32*e4b17023SJohn Marino(this paragraph not being relevant to the printed manual). 33*e4b17023SJohn Marino 34*e4b17023SJohn Marino@end ignore 35*e4b17023SJohn MarinoPermission is granted to copy and distribute modified versions of this 36*e4b17023SJohn Marinomanual under the conditions for verbatim copying, provided also that 37*e4b17023SJohn Marinothe entire resulting derived work is distributed under the terms of a 38*e4b17023SJohn Marinopermission notice identical to this one. 39*e4b17023SJohn Marino 40*e4b17023SJohn MarinoPermission is granted to copy and distribute translations of this manual 41*e4b17023SJohn Marinointo another language, under the above conditions for modified versions. 42*e4b17023SJohn Marino@end ifinfo 43*e4b17023SJohn Marino 44*e4b17023SJohn Marino@titlepage 45*e4b17023SJohn Marino@title Cpplib Internals 46*e4b17023SJohn Marino@versionsubtitle 47*e4b17023SJohn Marino@author Neil Booth 48*e4b17023SJohn Marino@page 49*e4b17023SJohn Marino@vskip 0pt plus 1filll 50*e4b17023SJohn Marino@c man begin COPYRIGHT 51*e4b17023SJohn MarinoCopyright @copyright{} 2000, 2001, 2002, 2004, 2005 52*e4b17023SJohn MarinoFree Software Foundation, Inc. 53*e4b17023SJohn Marino 54*e4b17023SJohn MarinoPermission is granted to make and distribute verbatim copies of 55*e4b17023SJohn Marinothis manual provided the copyright notice and this permission notice 56*e4b17023SJohn Marinoare preserved on all copies. 57*e4b17023SJohn Marino 58*e4b17023SJohn MarinoPermission is granted to copy and distribute modified versions of this 59*e4b17023SJohn Marinomanual under the conditions for verbatim copying, provided also that 60*e4b17023SJohn Marinothe entire resulting derived work is distributed under the terms of a 61*e4b17023SJohn Marinopermission notice identical to this one. 62*e4b17023SJohn Marino 63*e4b17023SJohn MarinoPermission is granted to copy and distribute translations of this manual 64*e4b17023SJohn Marinointo another language, under the above conditions for modified versions. 65*e4b17023SJohn Marino@c man end 66*e4b17023SJohn Marino@end titlepage 67*e4b17023SJohn Marino@contents 68*e4b17023SJohn Marino@page 69*e4b17023SJohn Marino 70*e4b17023SJohn Marino@ifnottex 71*e4b17023SJohn Marino@node Top 72*e4b17023SJohn Marino@top 73*e4b17023SJohn Marino@chapter Cpplib---the GNU C Preprocessor 74*e4b17023SJohn Marino 75*e4b17023SJohn MarinoThe GNU C preprocessor is 76*e4b17023SJohn Marinoimplemented as a library, @dfn{cpplib}, so it can be easily shared between 77*e4b17023SJohn Marinoa stand-alone preprocessor, and a preprocessor integrated with the C, 78*e4b17023SJohn MarinoC++ and Objective-C front ends. It is also available for use by other 79*e4b17023SJohn Marinoprograms, though this is not recommended as its exposed interface has 80*e4b17023SJohn Marinonot yet reached a point of reasonable stability. 81*e4b17023SJohn Marino 82*e4b17023SJohn MarinoThe library has been written to be re-entrant, so that it can be used 83*e4b17023SJohn Marinoto preprocess many files simultaneously if necessary. It has also been 84*e4b17023SJohn Marinowritten with the preprocessing token as the fundamental unit; the 85*e4b17023SJohn Marinopreprocessor in previous versions of GCC would operate on text strings 86*e4b17023SJohn Marinoas the fundamental unit. 87*e4b17023SJohn Marino 88*e4b17023SJohn MarinoThis brief manual documents the internals of cpplib, and explains some 89*e4b17023SJohn Marinoof the tricky issues. It is intended that, along with the comments in 90*e4b17023SJohn Marinothe source code, a reasonably competent C programmer should be able to 91*e4b17023SJohn Marinofigure out what the code is doing, and why things have been implemented 92*e4b17023SJohn Marinothe way they have. 93*e4b17023SJohn Marino 94*e4b17023SJohn Marino@menu 95*e4b17023SJohn Marino* Conventions:: Conventions used in the code. 96*e4b17023SJohn Marino* Lexer:: The combined C, C++ and Objective-C Lexer. 97*e4b17023SJohn Marino* Hash Nodes:: All identifiers are entered into a hash table. 98*e4b17023SJohn Marino* Macro Expansion:: Macro expansion algorithm. 99*e4b17023SJohn Marino* Token Spacing:: Spacing and paste avoidance issues. 100*e4b17023SJohn Marino* Line Numbering:: Tracking location within files. 101*e4b17023SJohn Marino* Guard Macros:: Optimizing header files with guard macros. 102*e4b17023SJohn Marino* Files:: File handling. 103*e4b17023SJohn Marino* Concept Index:: Index. 104*e4b17023SJohn Marino@end menu 105*e4b17023SJohn Marino@end ifnottex 106*e4b17023SJohn Marino 107*e4b17023SJohn Marino@node Conventions 108*e4b17023SJohn Marino@unnumbered Conventions 109*e4b17023SJohn Marino@cindex interface 110*e4b17023SJohn Marino@cindex header files 111*e4b17023SJohn Marino 112*e4b17023SJohn Marinocpplib has two interfaces---one is exposed internally only, and the 113*e4b17023SJohn Marinoother is for both internal and external use. 114*e4b17023SJohn Marino 115*e4b17023SJohn MarinoThe convention is that functions and types that are exposed to multiple 116*e4b17023SJohn Marinofiles internally are prefixed with @samp{_cpp_}, and are to be found in 117*e4b17023SJohn Marinothe file @file{internal.h}. Functions and types exposed to external 118*e4b17023SJohn Marinoclients are in @file{cpplib.h}, and prefixed with @samp{cpp_}. For 119*e4b17023SJohn Marinohistorical reasons this is no longer quite true, but we should strive to 120*e4b17023SJohn Marinostick to it. 121*e4b17023SJohn Marino 122*e4b17023SJohn MarinoWe are striving to reduce the information exposed in @file{cpplib.h} to the 123*e4b17023SJohn Marinobare minimum necessary, and then to keep it there. This makes clear 124*e4b17023SJohn Marinoexactly what external clients are entitled to assume, and allows us to 125*e4b17023SJohn Marinochange internals in the future without worrying whether library clients 126*e4b17023SJohn Marinoare perhaps relying on some kind of undocumented implementation-specific 127*e4b17023SJohn Marinobehavior. 128*e4b17023SJohn Marino 129*e4b17023SJohn Marino@node Lexer 130*e4b17023SJohn Marino@unnumbered The Lexer 131*e4b17023SJohn Marino@cindex lexer 132*e4b17023SJohn Marino@cindex newlines 133*e4b17023SJohn Marino@cindex escaped newlines 134*e4b17023SJohn Marino 135*e4b17023SJohn Marino@section Overview 136*e4b17023SJohn MarinoThe lexer is contained in the file @file{lex.c}. It is a hand-coded 137*e4b17023SJohn Marinolexer, and not implemented as a state machine. It can understand C, C++ 138*e4b17023SJohn Marinoand Objective-C source code, and has been extended to allow reasonably 139*e4b17023SJohn Marinosuccessful preprocessing of assembly language. The lexer does not make 140*e4b17023SJohn Marinoan initial pass to strip out trigraphs and escaped newlines, but handles 141*e4b17023SJohn Marinothem as they are encountered in a single pass of the input file. It 142*e4b17023SJohn Marinoreturns preprocessing tokens individually, not a line at a time. 143*e4b17023SJohn Marino 144*e4b17023SJohn MarinoIt is mostly transparent to users of the library, since the library's 145*e4b17023SJohn Marinointerface for obtaining the next token, @code{cpp_get_token}, takes care 146*e4b17023SJohn Marinoof lexing new tokens, handling directives, and expanding macros as 147*e4b17023SJohn Marinonecessary. However, the lexer does expose some functionality so that 148*e4b17023SJohn Marinoclients of the library can easily spell a given token, such as 149*e4b17023SJohn Marino@code{cpp_spell_token} and @code{cpp_token_len}. These functions are 150*e4b17023SJohn Marinouseful when generating diagnostics, and for emitting the preprocessed 151*e4b17023SJohn Marinooutput. 152*e4b17023SJohn Marino 153*e4b17023SJohn Marino@section Lexing a token 154*e4b17023SJohn MarinoLexing of an individual token is handled by @code{_cpp_lex_direct} and 155*e4b17023SJohn Marinoits subroutines. In its current form the code is quite complicated, 156*e4b17023SJohn Marinowith read ahead characters and such-like, since it strives to not step 157*e4b17023SJohn Marinoback in the character stream in preparation for handling non-ASCII file 158*e4b17023SJohn Marinoencodings. The current plan is to convert any such files to UTF-8 159*e4b17023SJohn Marinobefore processing them. This complexity is therefore unnecessary and 160*e4b17023SJohn Marinowill be removed, so I'll not discuss it further here. 161*e4b17023SJohn Marino 162*e4b17023SJohn MarinoThe job of @code{_cpp_lex_direct} is simply to lex a token. It is not 163*e4b17023SJohn Marinoresponsible for issues like directive handling, returning lookahead 164*e4b17023SJohn Marinotokens directly, multiple-include optimization, or conditional block 165*e4b17023SJohn Marinoskipping. It necessarily has a minor r@^ole to play in memory 166*e4b17023SJohn Marinomanagement of lexed lines. I discuss these issues in a separate section 167*e4b17023SJohn Marino(@pxref{Lexing a line}). 168*e4b17023SJohn Marino 169*e4b17023SJohn MarinoThe lexer places the token it lexes into storage pointed to by the 170*e4b17023SJohn Marinovariable @code{cur_token}, and then increments it. This variable is 171*e4b17023SJohn Marinoimportant for correct diagnostic positioning. Unless a specific line 172*e4b17023SJohn Marinoand column are passed to the diagnostic routines, they will examine the 173*e4b17023SJohn Marino@code{line} and @code{col} values of the token just before the location 174*e4b17023SJohn Marinothat @code{cur_token} points to, and use that location to report the 175*e4b17023SJohn Marinodiagnostic. 176*e4b17023SJohn Marino 177*e4b17023SJohn MarinoThe lexer does not consider whitespace to be a token in its own right. 178*e4b17023SJohn MarinoIf whitespace (other than a new line) precedes a token, it sets the 179*e4b17023SJohn Marino@code{PREV_WHITE} bit in the token's flags. Each token has its 180*e4b17023SJohn Marino@code{line} and @code{col} variables set to the line and column of the 181*e4b17023SJohn Marinofirst character of the token. This line number is the line number in 182*e4b17023SJohn Marinothe translation unit, and can be converted to a source (file, line) pair 183*e4b17023SJohn Marinousing the line map code. 184*e4b17023SJohn Marino 185*e4b17023SJohn MarinoThe first token on a logical, i.e.@: unescaped, line has the flag 186*e4b17023SJohn Marino@code{BOL} set for beginning-of-line. This flag is intended for 187*e4b17023SJohn Marinointernal use, both to distinguish a @samp{#} that begins a directive 188*e4b17023SJohn Marinofrom one that doesn't, and to generate a call-back to clients that want 189*e4b17023SJohn Marinoto be notified about the start of every non-directive line with tokens 190*e4b17023SJohn Marinoon it. Clients cannot reliably determine this for themselves: the first 191*e4b17023SJohn Marinotoken might be a macro, and the tokens of a macro expansion do not have 192*e4b17023SJohn Marinothe @code{BOL} flag set. The macro expansion may even be empty, and the 193*e4b17023SJohn Marinonext token on the line certainly won't have the @code{BOL} flag set. 194*e4b17023SJohn Marino 195*e4b17023SJohn MarinoNew lines are treated specially; exactly how the lexer handles them is 196*e4b17023SJohn Marinocontext-dependent. The C standard mandates that directives are 197*e4b17023SJohn Marinoterminated by the first unescaped newline character, even if it appears 198*e4b17023SJohn Marinoin the middle of a macro expansion. Therefore, if the state variable 199*e4b17023SJohn Marino@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token, 200*e4b17023SJohn Marinowhich is normally used to indicate end-of-file, to indicate 201*e4b17023SJohn Marinoend-of-directive. In a directive a @code{CPP_EOF} token never means 202*e4b17023SJohn Marinoend-of-file. Conveniently, if the caller was @code{collect_args}, it 203*e4b17023SJohn Marinoalready handles @code{CPP_EOF} as if it were end-of-file, and reports an 204*e4b17023SJohn Marinoerror about an unterminated macro argument list. 205*e4b17023SJohn Marino 206*e4b17023SJohn MarinoThe C standard also specifies that a new line in the middle of the 207*e4b17023SJohn Marinoarguments to a macro is treated as whitespace. This white space is 208*e4b17023SJohn Marinoimportant in case the macro argument is stringified. The state variable 209*e4b17023SJohn Marino@code{parsing_args} is nonzero when the preprocessor is collecting the 210*e4b17023SJohn Marinoarguments to a macro call. It is set to 1 when looking for the opening 211*e4b17023SJohn Marinoparenthesis to a function-like macro, and 2 when collecting the actual 212*e4b17023SJohn Marinoarguments up to the closing parenthesis, since these two cases need to 213*e4b17023SJohn Marinobe distinguished sometimes. One such time is here: the lexer sets the 214*e4b17023SJohn Marino@code{PREV_WHITE} flag of a token if it meets a new line when 215*e4b17023SJohn Marino@code{parsing_args} is set to 2. It doesn't set it if it meets a new 216*e4b17023SJohn Marinoline when @code{parsing_args} is 1, since then code like 217*e4b17023SJohn Marino 218*e4b17023SJohn Marino@smallexample 219*e4b17023SJohn Marino#define foo() bar 220*e4b17023SJohn Marinofoo 221*e4b17023SJohn Marinobaz 222*e4b17023SJohn Marino@end smallexample 223*e4b17023SJohn Marino 224*e4b17023SJohn Marino@noindent would be output with an erroneous space before @samp{baz}: 225*e4b17023SJohn Marino 226*e4b17023SJohn Marino@smallexample 227*e4b17023SJohn Marinofoo 228*e4b17023SJohn Marino baz 229*e4b17023SJohn Marino@end smallexample 230*e4b17023SJohn Marino 231*e4b17023SJohn MarinoThis is a good example of the subtlety of getting token spacing correct 232*e4b17023SJohn Marinoin the preprocessor; there are plenty of tests in the testsuite for 233*e4b17023SJohn Marinocorner cases like this. 234*e4b17023SJohn Marino 235*e4b17023SJohn MarinoThe lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n} 236*e4b17023SJohn Marinoand @samp{\n\r} as a single new line indicator. This allows it to 237*e4b17023SJohn Marinotransparently preprocess MS-DOS, Macintosh and Unix files without their 238*e4b17023SJohn Marinoneeding to pass through a special filter beforehand. 239*e4b17023SJohn Marino 240*e4b17023SJohn MarinoWe also decided to treat a backslash, either @samp{\} or the trigraph 241*e4b17023SJohn Marino@samp{??/}, separated from one of the above newline indicators by 242*e4b17023SJohn Marinonon-comment whitespace only, as intending to escape the newline. It 243*e4b17023SJohn Marinotends to be a typing mistake, and cannot reasonably be mistaken for 244*e4b17023SJohn Marinoanything else in any of the C-family grammars. Since handling it this 245*e4b17023SJohn Marinoway is not strictly conforming to the ISO standard, the library issues a 246*e4b17023SJohn Marinowarning wherever it encounters it. 247*e4b17023SJohn Marino 248*e4b17023SJohn MarinoHandling newlines like this is made simpler by doing it in one place 249*e4b17023SJohn Marinoonly. The function @code{handle_newline} takes care of all newline 250*e4b17023SJohn Marinocharacters, and @code{skip_escaped_newlines} takes care of arbitrarily 251*e4b17023SJohn Marinolong sequences of escaped newlines, deferring to @code{handle_newline} 252*e4b17023SJohn Marinoto handle the newlines themselves. 253*e4b17023SJohn Marino 254*e4b17023SJohn MarinoThe most painful aspect of lexing ISO-standard C and C++ is handling 255*e4b17023SJohn Marinotrigraphs and backlash-escaped newlines. Trigraphs are processed before 256*e4b17023SJohn Marinoany interpretation of the meaning of a character is made, and unfortunately 257*e4b17023SJohn Marinothere is a trigraph representation for a backslash, so it is possible for 258*e4b17023SJohn Marinothe trigraph @samp{??/} to introduce an escaped newline. 259*e4b17023SJohn Marino 260*e4b17023SJohn MarinoEscaped newlines are tedious because theoretically they can occur 261*e4b17023SJohn Marinoanywhere---between the @samp{+} and @samp{=} of the @samp{+=} token, 262*e4b17023SJohn Marinowithin the characters of an identifier, and even between the @samp{*} 263*e4b17023SJohn Marinoand @samp{/} that terminates a comment. Moreover, you cannot be sure 264*e4b17023SJohn Marinothere is just one---there might be an arbitrarily long sequence of them. 265*e4b17023SJohn Marino 266*e4b17023SJohn MarinoSo, for example, the routine that lexes a number, @code{parse_number}, 267*e4b17023SJohn Marinocannot assume that it can scan forwards until the first non-number 268*e4b17023SJohn Marinocharacter and be done with it, because this could be the @samp{\} 269*e4b17023SJohn Marinointroducing an escaped newline, or the @samp{?} introducing the trigraph 270*e4b17023SJohn Marinosequence that represents the @samp{\} of an escaped newline. If it 271*e4b17023SJohn Marinoencounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines} 272*e4b17023SJohn Marinoto skip over any potential escaped newlines before checking whether the 273*e4b17023SJohn Marinonumber has been finished. 274*e4b17023SJohn Marino 275*e4b17023SJohn MarinoSimilarly code in the main body of @code{_cpp_lex_direct} cannot simply 276*e4b17023SJohn Marinocheck for a @samp{=} after a @samp{+} character to determine whether it 277*e4b17023SJohn Marinohas a @samp{+=} token; it needs to be prepared for an escaped newline of 278*e4b17023SJohn Marinosome sort. Such cases use the function @code{get_effective_char}, which 279*e4b17023SJohn Marinoreturns the first character after any intervening escaped newlines. 280*e4b17023SJohn Marino 281*e4b17023SJohn MarinoThe lexer needs to keep track of the correct column position, including 282*e4b17023SJohn Marinocounting tabs as specified by the @option{-ftabstop=} option. This 283*e4b17023SJohn Marinoshould be done even within C-style comments; they can appear in the 284*e4b17023SJohn Marinomiddle of a line, and we want to report diagnostics in the correct 285*e4b17023SJohn Marinoposition for text appearing after the end of the comment. 286*e4b17023SJohn Marino 287*e4b17023SJohn Marino@anchor{Invalid identifiers} 288*e4b17023SJohn MarinoSome identifiers, such as @code{__VA_ARGS__} and poisoned identifiers, 289*e4b17023SJohn Marinomay be invalid and require a diagnostic. However, if they appear in a 290*e4b17023SJohn Marinomacro expansion we don't want to complain with each use of the macro. 291*e4b17023SJohn MarinoIt is therefore best to catch them during the lexing stage, in 292*e4b17023SJohn Marino@code{parse_identifier}. In both cases, whether a diagnostic is needed 293*e4b17023SJohn Marinoor not is dependent upon the lexer's state. For example, we don't want 294*e4b17023SJohn Marinoto issue a diagnostic for re-poisoning a poisoned identifier, or for 295*e4b17023SJohn Marinousing @code{__VA_ARGS__} in the expansion of a variable-argument macro. 296*e4b17023SJohn MarinoTherefore @code{parse_identifier} makes use of state flags to determine 297*e4b17023SJohn Marinowhether a diagnostic is appropriate. Since we change state on a 298*e4b17023SJohn Marinoper-token basis, and don't lex whole lines at a time, this is not a 299*e4b17023SJohn Marinoproblem. 300*e4b17023SJohn Marino 301*e4b17023SJohn MarinoAnother place where state flags are used to change behavior is whilst 302*e4b17023SJohn Marinolexing header names. Normally, a @samp{<} would be lexed as a single 303*e4b17023SJohn Marinotoken. After a @code{#include} directive, though, it should be lexed as 304*e4b17023SJohn Marinoa single token as far as the nearest @samp{>} character. Note that we 305*e4b17023SJohn Marinodon't allow the terminators of header names to be escaped; the first 306*e4b17023SJohn Marino@samp{"} or @samp{>} terminates the header name. 307*e4b17023SJohn Marino 308*e4b17023SJohn MarinoInterpretation of some character sequences depends upon whether we are 309*e4b17023SJohn Marinolexing C, C++ or Objective-C, and on the revision of the standard in 310*e4b17023SJohn Marinoforce. For example, @samp{::} is a single token in C++, but in C it is 311*e4b17023SJohn Marinotwo separate @samp{:} tokens and almost certainly a syntax error. Such 312*e4b17023SJohn Marinocases are handled by @code{_cpp_lex_direct} based upon command-line 313*e4b17023SJohn Marinoflags stored in the @code{cpp_options} structure. 314*e4b17023SJohn Marino 315*e4b17023SJohn MarinoOnce a token has been lexed, it leads an independent existence. The 316*e4b17023SJohn Marinospelling of numbers, identifiers and strings is copied to permanent 317*e4b17023SJohn Marinostorage from the original input buffer, so a token remains valid and 318*e4b17023SJohn Marinocorrect even if its source buffer is freed with @code{_cpp_pop_buffer}. 319*e4b17023SJohn MarinoThe storage holding the spellings of such tokens remains until the 320*e4b17023SJohn Marinoclient program calls cpp_destroy, probably at the end of the translation 321*e4b17023SJohn Marinounit. 322*e4b17023SJohn Marino 323*e4b17023SJohn Marino@anchor{Lexing a line} 324*e4b17023SJohn Marino@section Lexing a line 325*e4b17023SJohn Marino@cindex token run 326*e4b17023SJohn Marino 327*e4b17023SJohn MarinoWhen the preprocessor was changed to return pointers to tokens, one 328*e4b17023SJohn Marinofeature I wanted was some sort of guarantee regarding how long a 329*e4b17023SJohn Marinoreturned pointer remains valid. This is important to the stand-alone 330*e4b17023SJohn Marinopreprocessor, the future direction of the C family front ends, and even 331*e4b17023SJohn Marinoto cpplib itself internally. 332*e4b17023SJohn Marino 333*e4b17023SJohn MarinoOccasionally the preprocessor wants to be able to peek ahead in the 334*e4b17023SJohn Marinotoken stream. For example, after the name of a function-like macro, it 335*e4b17023SJohn Marinowants to check the next token to see if it is an opening parenthesis. 336*e4b17023SJohn MarinoAnother example is that, after reading the first few tokens of a 337*e4b17023SJohn Marino@code{#pragma} directive and not recognizing it as a registered pragma, 338*e4b17023SJohn Marinoit wants to backtrack and allow the user-defined handler for unknown 339*e4b17023SJohn Marinopragmas to access the full @code{#pragma} token stream. The stand-alone 340*e4b17023SJohn Marinopreprocessor wants to be able to test the current token with the 341*e4b17023SJohn Marinoprevious one to see if a space needs to be inserted to preserve their 342*e4b17023SJohn Marinoseparate tokenization upon re-lexing (paste avoidance), so it needs to 343*e4b17023SJohn Marinobe sure the pointer to the previous token is still valid. The 344*e4b17023SJohn Marinorecursive-descent C++ parser wants to be able to perform tentative 345*e4b17023SJohn Marinoparsing arbitrarily far ahead in the token stream, and then to be able 346*e4b17023SJohn Marinoto jump back to a prior position in that stream if necessary. 347*e4b17023SJohn Marino 348*e4b17023SJohn MarinoThe rule I chose, which is fairly natural, is to arrange that the 349*e4b17023SJohn Marinopreprocessor lex all tokens on a line consecutively into a token buffer, 350*e4b17023SJohn Marinowhich I call a @dfn{token run}, and when meeting an unescaped new line 351*e4b17023SJohn Marino(newlines within comments do not count either), to start lexing back at 352*e4b17023SJohn Marinothe beginning of the run. Note that we do @emph{not} lex a line of 353*e4b17023SJohn Marinotokens at once; if we did that @code{parse_identifier} would not have 354*e4b17023SJohn Marinostate flags available to warn about invalid identifiers (@pxref{Invalid 355*e4b17023SJohn Marinoidentifiers}). 356*e4b17023SJohn Marino 357*e4b17023SJohn MarinoIn other words, accessing tokens that appeared earlier in the current 358*e4b17023SJohn Marinoline is valid, but since each logical line overwrites the tokens of the 359*e4b17023SJohn Marinoprevious line, tokens from prior lines are unavailable. In particular, 360*e4b17023SJohn Marinosince a directive only occupies a single logical line, this means that 361*e4b17023SJohn Marinothe directive handlers like the @code{#pragma} handler can jump around 362*e4b17023SJohn Marinoin the directive's tokens if necessary. 363*e4b17023SJohn Marino 364*e4b17023SJohn MarinoTwo issues remain: what about tokens that arise from macro expansions, 365*e4b17023SJohn Marinoand what happens when we have a long line that overflows the token run? 366*e4b17023SJohn Marino 367*e4b17023SJohn MarinoSince we promise clients that we preserve the validity of pointers that 368*e4b17023SJohn Marinowe have already returned for tokens that appeared earlier in the line, 369*e4b17023SJohn Marinowe cannot reallocate the run. Instead, on overflow it is expanded by 370*e4b17023SJohn Marinochaining a new token run on to the end of the existing one. 371*e4b17023SJohn Marino 372*e4b17023SJohn MarinoThe tokens forming a macro's replacement list are collected by the 373*e4b17023SJohn Marino@code{#define} handler, and placed in storage that is only freed by 374*e4b17023SJohn Marino@code{cpp_destroy}. So if a macro is expanded in the line of tokens, 375*e4b17023SJohn Marinothe pointers to the tokens of its expansion that are returned will always 376*e4b17023SJohn Marinoremain valid. However, macros are a little trickier than that, since 377*e4b17023SJohn Marinothey give rise to three sources of fresh tokens. They are the built-in 378*e4b17023SJohn Marinomacros like @code{__LINE__}, and the @samp{#} and @samp{##} operators 379*e4b17023SJohn Marinofor stringification and token pasting. I handled this by allocating 380*e4b17023SJohn Marinospace for these tokens from the lexer's token run chain. This means 381*e4b17023SJohn Marinothey automatically receive the same lifetime guarantees as lexed tokens, 382*e4b17023SJohn Marinoand we don't need to concern ourselves with freeing them. 383*e4b17023SJohn Marino 384*e4b17023SJohn MarinoLexing into a line of tokens solves some of the token memory management 385*e4b17023SJohn Marinoissues, but not all. The opening parenthesis after a function-like 386*e4b17023SJohn Marinomacro name might lie on a different line, and the front ends definitely 387*e4b17023SJohn Marinowant the ability to look ahead past the end of the current line. So 388*e4b17023SJohn Marinocpplib only moves back to the start of the token run at the end of a 389*e4b17023SJohn Marinoline if the variable @code{keep_tokens} is zero. Line-buffering is 390*e4b17023SJohn Marinoquite natural for the preprocessor, and as a result the only time cpplib 391*e4b17023SJohn Marinoneeds to increment this variable is whilst looking for the opening 392*e4b17023SJohn Marinoparenthesis to, and reading the arguments of, a function-like macro. In 393*e4b17023SJohn Marinothe near future cpplib will export an interface to increment and 394*e4b17023SJohn Marinodecrement this variable, so that clients can share full control over the 395*e4b17023SJohn Marinolifetime of token pointers too. 396*e4b17023SJohn Marino 397*e4b17023SJohn MarinoThe routine @code{_cpp_lex_token} handles moving to new token runs, 398*e4b17023SJohn Marinocalling @code{_cpp_lex_direct} to lex new tokens, or returning 399*e4b17023SJohn Marinopreviously-lexed tokens if we stepped back in the token stream. It also 400*e4b17023SJohn Marinochecks each token for the @code{BOL} flag, which might indicate a 401*e4b17023SJohn Marinodirective that needs to be handled, or require a start-of-line call-back 402*e4b17023SJohn Marinoto be made. @code{_cpp_lex_token} also handles skipping over tokens in 403*e4b17023SJohn Marinofailed conditional blocks, and invalidates the control macro of the 404*e4b17023SJohn Marinomultiple-include optimization if a token was successfully lexed outside 405*e4b17023SJohn Marinoa directive. In other words, its callers do not need to concern 406*e4b17023SJohn Marinothemselves with such issues. 407*e4b17023SJohn Marino 408*e4b17023SJohn Marino@node Hash Nodes 409*e4b17023SJohn Marino@unnumbered Hash Nodes 410*e4b17023SJohn Marino@cindex hash table 411*e4b17023SJohn Marino@cindex identifiers 412*e4b17023SJohn Marino@cindex macros 413*e4b17023SJohn Marino@cindex assertions 414*e4b17023SJohn Marino@cindex named operators 415*e4b17023SJohn Marino 416*e4b17023SJohn MarinoWhen cpplib encounters an ``identifier'', it generates a hash code for 417*e4b17023SJohn Marinoit and stores it in the hash table. By ``identifier'' we mean tokens 418*e4b17023SJohn Marinowith type @code{CPP_NAME}; this includes identifiers in the usual C 419*e4b17023SJohn Marinosense, as well as keywords, directive names, macro names and so on. For 420*e4b17023SJohn Marinoexample, all of @code{pragma}, @code{int}, @code{foo} and 421*e4b17023SJohn Marino@code{__GNUC__} are identifiers and hashed when lexed. 422*e4b17023SJohn Marino 423*e4b17023SJohn MarinoEach node in the hash table contain various information about the 424*e4b17023SJohn Marinoidentifier it represents. For example, its length and type. At any one 425*e4b17023SJohn Marinotime, each identifier falls into exactly one of three categories: 426*e4b17023SJohn Marino 427*e4b17023SJohn Marino@itemize @bullet 428*e4b17023SJohn Marino@item Macros 429*e4b17023SJohn Marino 430*e4b17023SJohn MarinoThese have been declared to be macros, either on the command line or 431*e4b17023SJohn Marinowith @code{#define}. A few, such as @code{__TIME__} are built-ins 432*e4b17023SJohn Marinoentered in the hash table during initialization. The hash node for a 433*e4b17023SJohn Marinonormal macro points to a structure with more information about the 434*e4b17023SJohn Marinomacro, such as whether it is function-like, how many arguments it takes, 435*e4b17023SJohn Marinoand its expansion. Built-in macros are flagged as special, and instead 436*e4b17023SJohn Marinocontain an enum indicating which of the various built-in macros it is. 437*e4b17023SJohn Marino 438*e4b17023SJohn Marino@item Assertions 439*e4b17023SJohn Marino 440*e4b17023SJohn MarinoAssertions are in a separate namespace to macros. To enforce this, cpp 441*e4b17023SJohn Marinoactually prepends a @code{#} character before hashing and entering it in 442*e4b17023SJohn Marinothe hash table. An assertion's node points to a chain of answers to 443*e4b17023SJohn Marinothat assertion. 444*e4b17023SJohn Marino 445*e4b17023SJohn Marino@item Void 446*e4b17023SJohn Marino 447*e4b17023SJohn MarinoEverything else falls into this category---an identifier that is not 448*e4b17023SJohn Marinocurrently a macro, or a macro that has since been undefined with 449*e4b17023SJohn Marino@code{#undef}. 450*e4b17023SJohn Marino 451*e4b17023SJohn MarinoWhen preprocessing C++, this category also includes the named operators, 452*e4b17023SJohn Marinosuch as @code{xor}. In expressions these behave like the operators they 453*e4b17023SJohn Marinorepresent, but in contexts where the spelling of a token matters they 454*e4b17023SJohn Marinoare spelt differently. This spelling distinction is relevant when they 455*e4b17023SJohn Marinoare operands of the stringizing and pasting macro operators @code{#} and 456*e4b17023SJohn Marino@code{##}. Named operator hash nodes are flagged, both to catch the 457*e4b17023SJohn Marinospelling distinction and to prevent them from being defined as macros. 458*e4b17023SJohn Marino@end itemize 459*e4b17023SJohn Marino 460*e4b17023SJohn MarinoThe same identifiers share the same hash node. Since each identifier 461*e4b17023SJohn Marinotoken, after lexing, contains a pointer to its hash node, this is used 462*e4b17023SJohn Marinoto provide rapid lookup of various information. For example, when 463*e4b17023SJohn Marinoparsing a @code{#define} statement, CPP flags each argument's identifier 464*e4b17023SJohn Marinohash node with the index of that argument. This makes duplicated 465*e4b17023SJohn Marinoargument checking an O(1) operation for each argument. Similarly, for 466*e4b17023SJohn Marinoeach identifier in the macro's expansion, lookup to see if it is an 467*e4b17023SJohn Marinoargument, and which argument it is, is also an O(1) operation. Further, 468*e4b17023SJohn Marinoeach directive name, such as @code{endif}, has an associated directive 469*e4b17023SJohn Marinoenum stored in its hash node, so that directive lookup is also O(1). 470*e4b17023SJohn Marino 471*e4b17023SJohn Marino@node Macro Expansion 472*e4b17023SJohn Marino@unnumbered Macro Expansion Algorithm 473*e4b17023SJohn Marino@cindex macro expansion 474*e4b17023SJohn Marino 475*e4b17023SJohn MarinoMacro expansion is a tricky operation, fraught with nasty corner cases 476*e4b17023SJohn Marinoand situations that render what you thought was a nifty way to 477*e4b17023SJohn Marinooptimize the preprocessor's expansion algorithm wrong in quite subtle 478*e4b17023SJohn Marinoways. 479*e4b17023SJohn Marino 480*e4b17023SJohn MarinoI strongly recommend you have a good grasp of how the C and C++ 481*e4b17023SJohn Marinostandards require macros to be expanded before diving into this 482*e4b17023SJohn Marinosection, let alone the code!. If you don't have a clear mental 483*e4b17023SJohn Marinopicture of how things like nested macro expansion, stringification and 484*e4b17023SJohn Marinotoken pasting are supposed to work, damage to your sanity can quickly 485*e4b17023SJohn Marinoresult. 486*e4b17023SJohn Marino 487*e4b17023SJohn Marino@section Internal representation of macros 488*e4b17023SJohn Marino@cindex macro representation (internal) 489*e4b17023SJohn Marino 490*e4b17023SJohn MarinoThe preprocessor stores macro expansions in tokenized form. This 491*e4b17023SJohn Marinosaves repeated lexing passes during expansion, at the cost of a small 492*e4b17023SJohn Marinoincrease in memory consumption on average. The tokens are stored 493*e4b17023SJohn Marinocontiguously in memory, so a pointer to the first one and a token 494*e4b17023SJohn Marinocount is all you need to get the replacement list of a macro. 495*e4b17023SJohn Marino 496*e4b17023SJohn MarinoIf the macro is a function-like macro the preprocessor also stores its 497*e4b17023SJohn Marinoparameters, in the form of an ordered list of pointers to the hash 498*e4b17023SJohn Marinotable entry of each parameter's identifier. Further, in the macro's 499*e4b17023SJohn Marinostored expansion each occurrence of a parameter is replaced with a 500*e4b17023SJohn Marinospecial token of type @code{CPP_MACRO_ARG}. Each such token holds the 501*e4b17023SJohn Marinoindex of the parameter it represents in the parameter list, which 502*e4b17023SJohn Marinoallows rapid replacement of parameters with their arguments during 503*e4b17023SJohn Marinoexpansion. Despite this optimization it is still necessary to store 504*e4b17023SJohn Marinothe original parameters to the macro, both for dumping with e.g., 505*e4b17023SJohn Marino@option{-dD}, and to warn about non-trivial macro redefinitions when 506*e4b17023SJohn Marinothe parameter names have changed. 507*e4b17023SJohn Marino 508*e4b17023SJohn Marino@section Macro expansion overview 509*e4b17023SJohn MarinoThe preprocessor maintains a @dfn{context stack}, implemented as a 510*e4b17023SJohn Marinolinked list of @code{cpp_context} structures, which together represent 511*e4b17023SJohn Marinothe macro expansion state at any one time. The @code{struct 512*e4b17023SJohn Marinocpp_reader} member variable @code{context} points to the current top 513*e4b17023SJohn Marinoof this stack. The top normally holds the unexpanded replacement list 514*e4b17023SJohn Marinoof the innermost macro under expansion, except when cpplib is about to 515*e4b17023SJohn Marinopre-expand an argument, in which case it holds that argument's 516*e4b17023SJohn Marinounexpanded tokens. 517*e4b17023SJohn Marino 518*e4b17023SJohn MarinoWhen there are no macros under expansion, cpplib is in @dfn{base 519*e4b17023SJohn Marinocontext}. All contexts other than the base context contain a 520*e4b17023SJohn Marinocontiguous list of tokens delimited by a starting and ending token. 521*e4b17023SJohn MarinoWhen not in base context, cpplib obtains the next token from the list 522*e4b17023SJohn Marinoof the top context. If there are no tokens left in the list, it pops 523*e4b17023SJohn Marinothat context off the stack, and subsequent ones if necessary, until an 524*e4b17023SJohn Marinounexhausted context is found or it returns to base context. In base 525*e4b17023SJohn Marinocontext, cpplib reads tokens directly from the lexer. 526*e4b17023SJohn Marino 527*e4b17023SJohn MarinoIf it encounters an identifier that is both a macro and enabled for 528*e4b17023SJohn Marinoexpansion, cpplib prepares to push a new context for that macro on the 529*e4b17023SJohn Marinostack by calling the routine @code{enter_macro_context}. When this 530*e4b17023SJohn Marinoroutine returns, the new context will contain the unexpanded tokens of 531*e4b17023SJohn Marinothe replacement list of that macro. In the case of function-like 532*e4b17023SJohn Marinomacros, @code{enter_macro_context} also replaces any parameters in the 533*e4b17023SJohn Marinoreplacement list, stored as @code{CPP_MACRO_ARG} tokens, with the 534*e4b17023SJohn Marinoappropriate macro argument. If the standard requires that the 535*e4b17023SJohn Marinoparameter be replaced with its expanded argument, the argument will 536*e4b17023SJohn Marinohave been fully macro expanded first. 537*e4b17023SJohn Marino 538*e4b17023SJohn Marino@code{enter_macro_context} also handles special macros like 539*e4b17023SJohn Marino@code{__LINE__}. Although these macros expand to a single token which 540*e4b17023SJohn Marinocannot contain any further macros, for reasons of token spacing 541*e4b17023SJohn Marino(@pxref{Token Spacing}) and simplicity of implementation, cpplib 542*e4b17023SJohn Marinohandles these special macros by pushing a context containing just that 543*e4b17023SJohn Marinoone token. 544*e4b17023SJohn Marino 545*e4b17023SJohn MarinoThe final thing that @code{enter_macro_context} does before returning 546*e4b17023SJohn Marinois to mark the macro disabled for expansion (except for special macros 547*e4b17023SJohn Marinolike @code{__TIME__}). The macro is re-enabled when its context is 548*e4b17023SJohn Marinolater popped from the context stack, as described above. This strict 549*e4b17023SJohn Marinoordering ensures that a macro is disabled whilst its expansion is 550*e4b17023SJohn Marinobeing scanned, but that it is @emph{not} disabled whilst any arguments 551*e4b17023SJohn Marinoto it are being expanded. 552*e4b17023SJohn Marino 553*e4b17023SJohn Marino@section Scanning the replacement list for macros to expand 554*e4b17023SJohn MarinoThe C standard states that, after any parameters have been replaced 555*e4b17023SJohn Marinowith their possibly-expanded arguments, the replacement list is 556*e4b17023SJohn Marinoscanned for nested macros. Further, any identifiers in the 557*e4b17023SJohn Marinoreplacement list that are not expanded during this scan are never 558*e4b17023SJohn Marinoagain eligible for expansion in the future, if the reason they were 559*e4b17023SJohn Marinonot expanded is that the macro in question was disabled. 560*e4b17023SJohn Marino 561*e4b17023SJohn MarinoClearly this latter condition can only apply to tokens resulting from 562*e4b17023SJohn Marinoargument pre-expansion. Other tokens never have an opportunity to be 563*e4b17023SJohn Marinore-tested for expansion. It is possible for identifiers that are 564*e4b17023SJohn Marinofunction-like macros to not expand initially but to expand during a 565*e4b17023SJohn Marinolater scan. This occurs when the identifier is the last token of an 566*e4b17023SJohn Marinoargument (and therefore originally followed by a comma or a closing 567*e4b17023SJohn Marinoparenthesis in its macro's argument list), and when it replaces its 568*e4b17023SJohn Marinoparameter in the macro's replacement list, the subsequent token 569*e4b17023SJohn Marinohappens to be an opening parenthesis (itself possibly the first token 570*e4b17023SJohn Marinoof an argument). 571*e4b17023SJohn Marino 572*e4b17023SJohn MarinoIt is important to note that when cpplib reads the last token of a 573*e4b17023SJohn Marinogiven context, that context still remains on the stack. Only when 574*e4b17023SJohn Marinolooking for the @emph{next} token do we pop it off the stack and drop 575*e4b17023SJohn Marinoto a lower context. This makes backing up by one token easy, but more 576*e4b17023SJohn Marinoimportantly ensures that the macro corresponding to the current 577*e4b17023SJohn Marinocontext is still disabled when we are considering the last token of 578*e4b17023SJohn Marinoits replacement list for expansion (or indeed expanding it). As an 579*e4b17023SJohn Marinoexample, which illustrates many of the points above, consider 580*e4b17023SJohn Marino 581*e4b17023SJohn Marino@smallexample 582*e4b17023SJohn Marino#define foo(x) bar x 583*e4b17023SJohn Marinofoo(foo) (2) 584*e4b17023SJohn Marino@end smallexample 585*e4b17023SJohn Marino 586*e4b17023SJohn Marino@noindent which fully expands to @samp{bar foo (2)}. During pre-expansion 587*e4b17023SJohn Marinoof the argument, @samp{foo} does not expand even though the macro is 588*e4b17023SJohn Marinoenabled, since it has no following parenthesis [pre-expansion of an 589*e4b17023SJohn Marinoargument only uses tokens from that argument; it cannot take tokens 590*e4b17023SJohn Marinofrom whatever follows the macro invocation]. This still leaves the 591*e4b17023SJohn Marinoargument token @samp{foo} eligible for future expansion. Then, when 592*e4b17023SJohn Marinore-scanning after argument replacement, the token @samp{foo} is 593*e4b17023SJohn Marinorejected for expansion, and marked ineligible for future expansion, 594*e4b17023SJohn Marinosince the macro is now disabled. It is disabled because the 595*e4b17023SJohn Marinoreplacement list @samp{bar foo} of the macro is still on the context 596*e4b17023SJohn Marinostack. 597*e4b17023SJohn Marino 598*e4b17023SJohn MarinoIf instead the algorithm looked for an opening parenthesis first and 599*e4b17023SJohn Marinothen tested whether the macro were disabled it would be subtly wrong. 600*e4b17023SJohn MarinoIn the example above, the replacement list of @samp{foo} would be 601*e4b17023SJohn Marinopopped in the process of finding the parenthesis, re-enabling 602*e4b17023SJohn Marino@samp{foo} and expanding it a second time. 603*e4b17023SJohn Marino 604*e4b17023SJohn Marino@section Looking for a function-like macro's opening parenthesis 605*e4b17023SJohn MarinoFunction-like macros only expand when immediately followed by a 606*e4b17023SJohn Marinoparenthesis. To do this cpplib needs to temporarily disable macros 607*e4b17023SJohn Marinoand read the next token. Unfortunately, because of spacing issues 608*e4b17023SJohn Marino(@pxref{Token Spacing}), there can be fake padding tokens in-between, 609*e4b17023SJohn Marinoand if the next real token is not a parenthesis cpplib needs to be 610*e4b17023SJohn Marinoable to back up that one token as well as retain the information in 611*e4b17023SJohn Marinoany intervening padding tokens. 612*e4b17023SJohn Marino 613*e4b17023SJohn MarinoBacking up more than one token when macros are involved is not 614*e4b17023SJohn Marinopermitted by cpplib, because in general it might involve issues like 615*e4b17023SJohn Marinorestoring popped contexts onto the context stack, which are too hard. 616*e4b17023SJohn MarinoInstead, searching for the parenthesis is handled by a special 617*e4b17023SJohn Marinofunction, @code{funlike_invocation_p}, which remembers padding 618*e4b17023SJohn Marinoinformation as it reads tokens. If the next real token is not an 619*e4b17023SJohn Marinoopening parenthesis, it backs up that one token, and then pushes an 620*e4b17023SJohn Marinoextra context just containing the padding information if necessary. 621*e4b17023SJohn Marino 622*e4b17023SJohn Marino@section Marking tokens ineligible for future expansion 623*e4b17023SJohn MarinoAs discussed above, cpplib needs a way of marking tokens as 624*e4b17023SJohn Marinounexpandable. Since the tokens cpplib handles are read-only once they 625*e4b17023SJohn Marinohave been lexed, it instead makes a copy of the token and adds the 626*e4b17023SJohn Marinoflag @code{NO_EXPAND} to the copy. 627*e4b17023SJohn Marino 628*e4b17023SJohn MarinoFor efficiency and to simplify memory management by avoiding having to 629*e4b17023SJohn Marinoremember to free these tokens, they are allocated as temporary tokens 630*e4b17023SJohn Marinofrom the lexer's current token run (@pxref{Lexing a line}) using the 631*e4b17023SJohn Marinofunction @code{_cpp_temp_token}. The tokens are then re-used once the 632*e4b17023SJohn Marinocurrent line of tokens has been read in. 633*e4b17023SJohn Marino 634*e4b17023SJohn MarinoThis might sound unsafe. However, tokens runs are not re-used at the 635*e4b17023SJohn Marinoend of a line if it happens to be in the middle of a macro argument 636*e4b17023SJohn Marinolist, and cpplib only wants to back-up more than one lexer token in 637*e4b17023SJohn Marinosituations where no macro expansion is involved, so the optimization 638*e4b17023SJohn Marinois safe. 639*e4b17023SJohn Marino 640*e4b17023SJohn Marino@node Token Spacing 641*e4b17023SJohn Marino@unnumbered Token Spacing 642*e4b17023SJohn Marino@cindex paste avoidance 643*e4b17023SJohn Marino@cindex spacing 644*e4b17023SJohn Marino@cindex token spacing 645*e4b17023SJohn Marino 646*e4b17023SJohn MarinoFirst, consider an issue that only concerns the stand-alone 647*e4b17023SJohn Marinopreprocessor: there needs to be a guarantee that re-reading its preprocessed 648*e4b17023SJohn Marinooutput results in an identical token stream. Without taking special 649*e4b17023SJohn Marinomeasures, this might not be the case because of macro substitution. 650*e4b17023SJohn MarinoFor example: 651*e4b17023SJohn Marino 652*e4b17023SJohn Marino@smallexample 653*e4b17023SJohn Marino#define PLUS + 654*e4b17023SJohn Marino#define EMPTY 655*e4b17023SJohn Marino#define f(x) =x= 656*e4b17023SJohn Marino+PLUS -EMPTY- PLUS+ f(=) 657*e4b17023SJohn Marino @expansion{} + + - - + + = = = 658*e4b17023SJohn Marino@emph{not} 659*e4b17023SJohn Marino @expansion{} ++ -- ++ === 660*e4b17023SJohn Marino@end smallexample 661*e4b17023SJohn Marino 662*e4b17023SJohn MarinoOne solution would be to simply insert a space between all adjacent 663*e4b17023SJohn Marinotokens. However, we would like to keep space insertion to a minimum, 664*e4b17023SJohn Marinoboth for aesthetic reasons and because it causes problems for people who 665*e4b17023SJohn Marinostill try to abuse the preprocessor for things like Fortran source and 666*e4b17023SJohn MarinoMakefiles. 667*e4b17023SJohn Marino 668*e4b17023SJohn MarinoFor now, just notice that when tokens are added (or removed, as shown by 669*e4b17023SJohn Marinothe @code{EMPTY} example) from the original lexed token stream, we need 670*e4b17023SJohn Marinoto check for accidental token pasting. We call this @dfn{paste 671*e4b17023SJohn Marinoavoidance}. Token addition and removal can only occur because of macro 672*e4b17023SJohn Marinoexpansion, but accidental pasting can occur in many places: both before 673*e4b17023SJohn Marinoand after each macro replacement, each argument replacement, and 674*e4b17023SJohn Marinoadditionally each token created by the @samp{#} and @samp{##} operators. 675*e4b17023SJohn Marino 676*e4b17023SJohn MarinoLook at how the preprocessor gets whitespace output correct 677*e4b17023SJohn Marinonormally. The @code{cpp_token} structure contains a flags byte, and one 678*e4b17023SJohn Marinoof those flags is @code{PREV_WHITE}. This is flagged by the lexer, and 679*e4b17023SJohn Marinoindicates that the token was preceded by whitespace of some form other 680*e4b17023SJohn Marinothan a new line. The stand-alone preprocessor can use this flag to 681*e4b17023SJohn Marinodecide whether to insert a space between tokens in the output. 682*e4b17023SJohn Marino 683*e4b17023SJohn MarinoNow consider the result of the following macro expansion: 684*e4b17023SJohn Marino 685*e4b17023SJohn Marino@smallexample 686*e4b17023SJohn Marino#define add(x, y, z) x + y +z; 687*e4b17023SJohn Marinosum = add (1,2, 3); 688*e4b17023SJohn Marino @expansion{} sum = 1 + 2 +3; 689*e4b17023SJohn Marino@end smallexample 690*e4b17023SJohn Marino 691*e4b17023SJohn MarinoThe interesting thing here is that the tokens @samp{1} and @samp{2} are 692*e4b17023SJohn Marinooutput with a preceding space, and @samp{3} is output without a 693*e4b17023SJohn Marinopreceding space, but when lexed none of these tokens had that property. 694*e4b17023SJohn MarinoCareful consideration reveals that @samp{1} gets its preceding 695*e4b17023SJohn Marinowhitespace from the space preceding @samp{add} in the macro invocation, 696*e4b17023SJohn Marino@emph{not} replacement list. @samp{2} gets its whitespace from the 697*e4b17023SJohn Marinospace preceding the parameter @samp{y} in the macro replacement list, 698*e4b17023SJohn Marinoand @samp{3} has no preceding space because parameter @samp{z} has none 699*e4b17023SJohn Marinoin the replacement list. 700*e4b17023SJohn Marino 701*e4b17023SJohn MarinoOnce lexed, tokens are effectively fixed and cannot be altered, since 702*e4b17023SJohn Marinopointers to them might be held in many places, in particular by 703*e4b17023SJohn Marinoin-progress macro expansions. So instead of modifying the two tokens 704*e4b17023SJohn Marinoabove, the preprocessor inserts a special token, which I call a 705*e4b17023SJohn Marino@dfn{padding token}, into the token stream to indicate that spacing of 706*e4b17023SJohn Marinothe subsequent token is special. The preprocessor inserts padding 707*e4b17023SJohn Marinotokens in front of every macro expansion and expanded macro argument. 708*e4b17023SJohn MarinoThese point to a @dfn{source token} from which the subsequent real token 709*e4b17023SJohn Marinoshould inherit its spacing. In the above example, the source tokens are 710*e4b17023SJohn Marino@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the 711*e4b17023SJohn Marinomacro replacement list, respectively. 712*e4b17023SJohn Marino 713*e4b17023SJohn MarinoIt is quite easy to get multiple padding tokens in a row, for example if 714*e4b17023SJohn Marinoa macro's first replacement token expands straight into another macro. 715*e4b17023SJohn Marino 716*e4b17023SJohn Marino@smallexample 717*e4b17023SJohn Marino#define foo bar 718*e4b17023SJohn Marino#define bar baz 719*e4b17023SJohn Marino[foo] 720*e4b17023SJohn Marino @expansion{} [baz] 721*e4b17023SJohn Marino@end smallexample 722*e4b17023SJohn Marino 723*e4b17023SJohn MarinoHere, two padding tokens are generated with sources the @samp{foo} token 724*e4b17023SJohn Marinobetween the brackets, and the @samp{bar} token from foo's replacement 725*e4b17023SJohn Marinolist, respectively. Clearly the first padding token is the one to 726*e4b17023SJohn Marinouse, so the output code should contain a rule that the first 727*e4b17023SJohn Marinopadding token in a sequence is the one that matters. 728*e4b17023SJohn Marino 729*e4b17023SJohn MarinoBut what if a macro expansion is left? Adjusting the above 730*e4b17023SJohn Marinoexample slightly: 731*e4b17023SJohn Marino 732*e4b17023SJohn Marino@smallexample 733*e4b17023SJohn Marino#define foo bar 734*e4b17023SJohn Marino#define bar EMPTY baz 735*e4b17023SJohn Marino#define EMPTY 736*e4b17023SJohn Marino[foo] EMPTY; 737*e4b17023SJohn Marino @expansion{} [ baz] ; 738*e4b17023SJohn Marino@end smallexample 739*e4b17023SJohn Marino 740*e4b17023SJohn MarinoAs shown, now there should be a space before @samp{baz} and the 741*e4b17023SJohn Marinosemicolon in the output. 742*e4b17023SJohn Marino 743*e4b17023SJohn MarinoThe rules we decided above fail for @samp{baz}: we generate three 744*e4b17023SJohn Marinopadding tokens, one per macro invocation, before the token @samp{baz}. 745*e4b17023SJohn MarinoWe would then have it take its spacing from the first of these, which 746*e4b17023SJohn Marinocarries source token @samp{foo} with no leading space. 747*e4b17023SJohn Marino 748*e4b17023SJohn MarinoIt is vital that cpplib get spacing correct in these examples since any 749*e4b17023SJohn Marinoof these macro expansions could be stringified, where spacing matters. 750*e4b17023SJohn Marino 751*e4b17023SJohn MarinoSo, this demonstrates that not just entering macro and argument 752*e4b17023SJohn Marinoexpansions, but leaving them requires special handling too. I made 753*e4b17023SJohn Marinocpplib insert a padding token with a @code{NULL} source token when 754*e4b17023SJohn Marinoleaving macro expansions, as well as after each replaced argument in a 755*e4b17023SJohn Marinomacro's replacement list. It also inserts appropriate padding tokens on 756*e4b17023SJohn Marinoeither side of tokens created by the @samp{#} and @samp{##} operators. 757*e4b17023SJohn MarinoI expanded the rule so that, if we see a padding token with a 758*e4b17023SJohn Marino@code{NULL} source token, @emph{and} that source token has no leading 759*e4b17023SJohn Marinospace, then we behave as if we have seen no padding tokens at all. A 760*e4b17023SJohn Marinoquick check shows this rule will then get the above example correct as 761*e4b17023SJohn Marinowell. 762*e4b17023SJohn Marino 763*e4b17023SJohn MarinoNow a relationship with paste avoidance is apparent: we have to be 764*e4b17023SJohn Marinocareful about paste avoidance in exactly the same locations we have 765*e4b17023SJohn Marinopadding tokens in order to get white space correct. This makes 766*e4b17023SJohn Marinoimplementation of paste avoidance easy: wherever the stand-alone 767*e4b17023SJohn Marinopreprocessor is fixing up spacing because of padding tokens, and it 768*e4b17023SJohn Marinoturns out that no space is needed, it has to take the extra step to 769*e4b17023SJohn Marinocheck that a space is not needed after all to avoid an accidental paste. 770*e4b17023SJohn MarinoThe function @code{cpp_avoid_paste} advises whether a space is required 771*e4b17023SJohn Marinobetween two consecutive tokens. To avoid excessive spacing, it tries 772*e4b17023SJohn Marinohard to only require a space if one is likely to be necessary, but for 773*e4b17023SJohn Marinoreasons of efficiency it is slightly conservative and might recommend a 774*e4b17023SJohn Marinospace where one is not strictly needed. 775*e4b17023SJohn Marino 776*e4b17023SJohn Marino@node Line Numbering 777*e4b17023SJohn Marino@unnumbered Line numbering 778*e4b17023SJohn Marino@cindex line numbers 779*e4b17023SJohn Marino 780*e4b17023SJohn Marino@section Just which line number anyway? 781*e4b17023SJohn Marino 782*e4b17023SJohn MarinoThere are three reasonable requirements a cpplib client might have for 783*e4b17023SJohn Marinothe line number of a token passed to it: 784*e4b17023SJohn Marino 785*e4b17023SJohn Marino@itemize @bullet 786*e4b17023SJohn Marino@item 787*e4b17023SJohn MarinoThe source line it was lexed on. 788*e4b17023SJohn Marino@item 789*e4b17023SJohn MarinoThe line it is output on. This can be different to the line it was 790*e4b17023SJohn Marinolexed on if, for example, there are intervening escaped newlines or 791*e4b17023SJohn MarinoC-style comments. For example: 792*e4b17023SJohn Marino 793*e4b17023SJohn Marino@smallexample 794*e4b17023SJohn Marinofoo /* @r{A long 795*e4b17023SJohn Marinocomment} */ bar \ 796*e4b17023SJohn Marinobaz 797*e4b17023SJohn Marino@result{} 798*e4b17023SJohn Marinofoo bar baz 799*e4b17023SJohn Marino@end smallexample 800*e4b17023SJohn Marino 801*e4b17023SJohn Marino@item 802*e4b17023SJohn MarinoIf the token results from a macro expansion, the line of the macro name, 803*e4b17023SJohn Marinoor possibly the line of the closing parenthesis in the case of 804*e4b17023SJohn Marinofunction-like macro expansion. 805*e4b17023SJohn Marino@end itemize 806*e4b17023SJohn Marino 807*e4b17023SJohn MarinoThe @code{cpp_token} structure contains @code{line} and @code{col} 808*e4b17023SJohn Marinomembers. The lexer fills these in with the line and column of the first 809*e4b17023SJohn Marinocharacter of the token. Consequently, but maybe unexpectedly, a token 810*e4b17023SJohn Marinofrom the replacement list of a macro expansion carries the location of 811*e4b17023SJohn Marinothe token within the @code{#define} directive, because cpplib expands a 812*e4b17023SJohn Marinomacro by returning pointers to the tokens in its replacement list. The 813*e4b17023SJohn Marinocurrent implementation of cpplib assigns tokens created from built-in 814*e4b17023SJohn Marinomacros and the @samp{#} and @samp{##} operators the location of the most 815*e4b17023SJohn Marinorecently lexed token. This is a because they are allocated from the 816*e4b17023SJohn Marinolexer's token runs, and because of the way the diagnostic routines infer 817*e4b17023SJohn Marinothe appropriate location to report. 818*e4b17023SJohn Marino 819*e4b17023SJohn MarinoThe diagnostic routines in cpplib display the location of the most 820*e4b17023SJohn Marinorecently @emph{lexed} token, unless they are passed a specific line and 821*e4b17023SJohn Marinocolumn to report. For diagnostics regarding tokens that arise from 822*e4b17023SJohn Marinomacro expansions, it might also be helpful for the user to see the 823*e4b17023SJohn Marinooriginal location in the macro definition that the token came from. 824*e4b17023SJohn MarinoSince that is exactly the information each token carries, such an 825*e4b17023SJohn Marinoenhancement could be made relatively easily in future. 826*e4b17023SJohn Marino 827*e4b17023SJohn MarinoThe stand-alone preprocessor faces a similar problem when determining 828*e4b17023SJohn Marinothe correct line to output the token on: the position attached to a 829*e4b17023SJohn Marinotoken is fairly useless if the token came from a macro expansion. All 830*e4b17023SJohn Marinotokens on a logical line should be output on its first physical line, so 831*e4b17023SJohn Marinothe token's reported location is also wrong if it is part of a physical 832*e4b17023SJohn Marinoline other than the first. 833*e4b17023SJohn Marino 834*e4b17023SJohn MarinoTo solve these issues, cpplib provides a callback that is generated 835*e4b17023SJohn Marinowhenever it lexes a preprocessing token that starts a new logical line 836*e4b17023SJohn Marinoother than a directive. It passes this token (which may be a 837*e4b17023SJohn Marino@code{CPP_EOF} token indicating the end of the translation unit) to the 838*e4b17023SJohn Marinocallback routine, which can then use the line and column of this token 839*e4b17023SJohn Marinoto produce correct output. 840*e4b17023SJohn Marino 841*e4b17023SJohn Marino@section Representation of line numbers 842*e4b17023SJohn Marino 843*e4b17023SJohn MarinoAs mentioned above, cpplib stores with each token the line number that 844*e4b17023SJohn Marinoit was lexed on. In fact, this number is not the number of the line in 845*e4b17023SJohn Marinothe source file, but instead bears more resemblance to the number of the 846*e4b17023SJohn Marinoline in the translation unit. 847*e4b17023SJohn Marino 848*e4b17023SJohn MarinoThe preprocessor maintains a monotonic increasing line count, which is 849*e4b17023SJohn Marinoincremented at every new line character (and also at the end of any 850*e4b17023SJohn Marinobuffer that does not end in a new line). Since a line number of zero is 851*e4b17023SJohn Marinouseful to indicate certain special states and conditions, this variable 852*e4b17023SJohn Marinostarts counting from one. 853*e4b17023SJohn Marino 854*e4b17023SJohn MarinoThis variable therefore uniquely enumerates each line in the translation 855*e4b17023SJohn Marinounit. With some simple infrastructure, it is straight forward to map 856*e4b17023SJohn Marinofrom this to the original source file and line number pair, saving space 857*e4b17023SJohn Marinowhenever line number information needs to be saved. The code the 858*e4b17023SJohn Marinoimplements this mapping lies in the files @file{line-map.c} and 859*e4b17023SJohn Marino@file{line-map.h}. 860*e4b17023SJohn Marino 861*e4b17023SJohn MarinoCommand-line macros and assertions are implemented by pushing a buffer 862*e4b17023SJohn Marinocontaining the right hand side of an equivalent @code{#define} or 863*e4b17023SJohn Marino@code{#assert} directive. Some built-in macros are handled similarly. 864*e4b17023SJohn MarinoSince these are all processed before the first line of the main input 865*e4b17023SJohn Marinofile, it will typically have an assigned line closer to twenty than to 866*e4b17023SJohn Marinoone. 867*e4b17023SJohn Marino 868*e4b17023SJohn Marino@node Guard Macros 869*e4b17023SJohn Marino@unnumbered The Multiple-Include Optimization 870*e4b17023SJohn Marino@cindex guard macros 871*e4b17023SJohn Marino@cindex controlling macros 872*e4b17023SJohn Marino@cindex multiple-include optimization 873*e4b17023SJohn Marino 874*e4b17023SJohn MarinoHeader files are often of the form 875*e4b17023SJohn Marino 876*e4b17023SJohn Marino@smallexample 877*e4b17023SJohn Marino#ifndef FOO 878*e4b17023SJohn Marino#define FOO 879*e4b17023SJohn Marino@dots{} 880*e4b17023SJohn Marino#endif 881*e4b17023SJohn Marino@end smallexample 882*e4b17023SJohn Marino 883*e4b17023SJohn Marino@noindent 884*e4b17023SJohn Marinoto prevent the compiler from processing them more than once. The 885*e4b17023SJohn Marinopreprocessor notices such header files, so that if the header file 886*e4b17023SJohn Marinoappears in a subsequent @code{#include} directive and @code{FOO} is 887*e4b17023SJohn Marinodefined, then it is ignored and it doesn't preprocess or even re-open 888*e4b17023SJohn Marinothe file a second time. This is referred to as the @dfn{multiple 889*e4b17023SJohn Marinoinclude optimization}. 890*e4b17023SJohn Marino 891*e4b17023SJohn MarinoUnder what circumstances is such an optimization valid? If the file 892*e4b17023SJohn Marinowere included a second time, it can only be optimized away if that 893*e4b17023SJohn Marinoinclusion would result in no tokens to return, and no relevant 894*e4b17023SJohn Marinodirectives to process. Therefore the current implementation imposes 895*e4b17023SJohn Marinorequirements and makes some allowances as follows: 896*e4b17023SJohn Marino 897*e4b17023SJohn Marino@enumerate 898*e4b17023SJohn Marino@item 899*e4b17023SJohn MarinoThere must be no tokens outside the controlling @code{#if}-@code{#endif} 900*e4b17023SJohn Marinopair, but whitespace and comments are permitted. 901*e4b17023SJohn Marino 902*e4b17023SJohn Marino@item 903*e4b17023SJohn MarinoThere must be no directives outside the controlling directive pair, but 904*e4b17023SJohn Marinothe @dfn{null directive} (a line containing nothing other than a single 905*e4b17023SJohn Marino@samp{#} and possibly whitespace) is permitted. 906*e4b17023SJohn Marino 907*e4b17023SJohn Marino@item 908*e4b17023SJohn MarinoThe opening directive must be of the form 909*e4b17023SJohn Marino 910*e4b17023SJohn Marino@smallexample 911*e4b17023SJohn Marino#ifndef FOO 912*e4b17023SJohn Marino@end smallexample 913*e4b17023SJohn Marino 914*e4b17023SJohn Marinoor 915*e4b17023SJohn Marino 916*e4b17023SJohn Marino@smallexample 917*e4b17023SJohn Marino#if !defined FOO [equivalently, #if !defined(FOO)] 918*e4b17023SJohn Marino@end smallexample 919*e4b17023SJohn Marino 920*e4b17023SJohn Marino@item 921*e4b17023SJohn MarinoIn the second form above, the tokens forming the @code{#if} expression 922*e4b17023SJohn Marinomust have come directly from the source file---no macro expansion must 923*e4b17023SJohn Marinohave been involved. This is because macro definitions can change, and 924*e4b17023SJohn Marinotracking whether or not a relevant change has been made is not worth the 925*e4b17023SJohn Marinoimplementation cost. 926*e4b17023SJohn Marino 927*e4b17023SJohn Marino@item 928*e4b17023SJohn MarinoThere can be no @code{#else} or @code{#elif} directives at the outer 929*e4b17023SJohn Marinoconditional block level, because they would probably contain something 930*e4b17023SJohn Marinoof interest to a subsequent pass. 931*e4b17023SJohn Marino@end enumerate 932*e4b17023SJohn Marino 933*e4b17023SJohn MarinoFirst, when pushing a new file on the buffer stack, 934*e4b17023SJohn Marino@code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to 935*e4b17023SJohn Marino@code{NULL}, and sets @code{mi_valid} to @code{true}. This indicates 936*e4b17023SJohn Marinothat the preprocessor has not yet encountered anything that would 937*e4b17023SJohn Marinoinvalidate the multiple-include optimization. As described in the next 938*e4b17023SJohn Marinofew paragraphs, these two variables having these values effectively 939*e4b17023SJohn Marinoindicates top-of-file. 940*e4b17023SJohn Marino 941*e4b17023SJohn MarinoWhen about to return a token that is not part of a directive, 942*e4b17023SJohn Marino@code{_cpp_lex_token} sets @code{mi_valid} to @code{false}. This 943*e4b17023SJohn Marinoenforces the constraint that tokens outside the controlling conditional 944*e4b17023SJohn Marinoblock invalidate the optimization. 945*e4b17023SJohn Marino 946*e4b17023SJohn MarinoThe @code{do_if}, when appropriate, and @code{do_ifndef} directive 947*e4b17023SJohn Marinohandlers pass the controlling macro to the function 948*e4b17023SJohn Marino@code{push_conditional}. cpplib maintains a stack of nested conditional 949*e4b17023SJohn Marinoblocks, and after processing every opening conditional this function 950*e4b17023SJohn Marinopushes an @code{if_stack} structure onto the stack. In this structure 951*e4b17023SJohn Marinoit records the controlling macro for the block, provided there is one 952*e4b17023SJohn Marinoand we're at top-of-file (as described above). If an @code{#elif} or 953*e4b17023SJohn Marino@code{#else} directive is encountered, the controlling macro for that 954*e4b17023SJohn Marinoblock is cleared to @code{NULL}. Otherwise, it survives until the 955*e4b17023SJohn Marino@code{#endif} closing the block, upon which @code{do_endif} sets 956*e4b17023SJohn Marino@code{mi_valid} to true and stores the controlling macro in 957*e4b17023SJohn Marino@code{mi_cmacro}. 958*e4b17023SJohn Marino 959*e4b17023SJohn Marino@code{_cpp_handle_directive} clears @code{mi_valid} when processing any 960*e4b17023SJohn Marinodirective other than an opening conditional and the null directive. 961*e4b17023SJohn MarinoWith this, and requiring top-of-file to record a controlling macro, and 962*e4b17023SJohn Marinono @code{#else} or @code{#elif} for it to survive and be copied to 963*e4b17023SJohn Marino@code{mi_cmacro} by @code{do_endif}, we have enforced the absence of 964*e4b17023SJohn Marinodirectives outside the main conditional block for the optimization to be 965*e4b17023SJohn Marinoon. 966*e4b17023SJohn Marino 967*e4b17023SJohn MarinoNote that whilst we are inside the conditional block, @code{mi_valid} is 968*e4b17023SJohn Marinolikely to be reset to @code{false}, but this does not matter since 969*e4b17023SJohn Marinothe closing @code{#endif} restores it to @code{true} if appropriate. 970*e4b17023SJohn Marino 971*e4b17023SJohn MarinoFinally, since @code{_cpp_lex_direct} pops the file off the buffer stack 972*e4b17023SJohn Marinoat @code{EOF} without returning a token, if the @code{#endif} directive 973*e4b17023SJohn Marinowas not followed by any tokens, @code{mi_valid} is @code{true} and 974*e4b17023SJohn Marino@code{_cpp_pop_file_buffer} remembers the controlling macro associated 975*e4b17023SJohn Marinowith the file. Subsequent calls to @code{stack_include_file} result in 976*e4b17023SJohn Marinono buffer being pushed if the controlling macro is defined, effecting 977*e4b17023SJohn Marinothe optimization. 978*e4b17023SJohn Marino 979*e4b17023SJohn MarinoA quick word on how we handle the 980*e4b17023SJohn Marino 981*e4b17023SJohn Marino@smallexample 982*e4b17023SJohn Marino#if !defined FOO 983*e4b17023SJohn Marino@end smallexample 984*e4b17023SJohn Marino 985*e4b17023SJohn Marino@noindent 986*e4b17023SJohn Marinocase. @code{_cpp_parse_expr} and @code{parse_defined} take steps to see 987*e4b17023SJohn Marinowhether the three stages @samp{!}, @samp{defined-expression} and 988*e4b17023SJohn Marino@samp{end-of-directive} occur in order in a @code{#if} expression. If 989*e4b17023SJohn Marinoso, they return the guard macro to @code{do_if} in the variable 990*e4b17023SJohn Marino@code{mi_ind_cmacro}, and otherwise set it to @code{NULL}. 991*e4b17023SJohn Marino@code{enter_macro_context} sets @code{mi_valid} to false, so if a macro 992*e4b17023SJohn Marinowas expanded whilst parsing any part of the expression, then the 993*e4b17023SJohn Marinotop-of-file test in @code{push_conditional} fails and the optimization 994*e4b17023SJohn Marinois turned off. 995*e4b17023SJohn Marino 996*e4b17023SJohn Marino@node Files 997*e4b17023SJohn Marino@unnumbered File Handling 998*e4b17023SJohn Marino@cindex files 999*e4b17023SJohn Marino 1000*e4b17023SJohn MarinoFairly obviously, the file handling code of cpplib resides in the file 1001*e4b17023SJohn Marino@file{files.c}. It takes care of the details of file searching, 1002*e4b17023SJohn Marinoopening, reading and caching, for both the main source file and all the 1003*e4b17023SJohn Marinoheaders it recursively includes. 1004*e4b17023SJohn Marino 1005*e4b17023SJohn MarinoThe basic strategy is to minimize the number of system calls. On many 1006*e4b17023SJohn Marinosystems, the basic @code{open ()} and @code{fstat ()} system calls can 1007*e4b17023SJohn Marinobe quite expensive. For every @code{#include}-d file, we need to try 1008*e4b17023SJohn Marinoall the directories in the search path until we find a match. Some 1009*e4b17023SJohn Marinoprojects, such as glibc, pass twenty or thirty include paths on the 1010*e4b17023SJohn Marinocommand line, so this can rapidly become time consuming. 1011*e4b17023SJohn Marino 1012*e4b17023SJohn MarinoFor a header file we have not encountered before we have little choice 1013*e4b17023SJohn Marinobut to do this. However, it is often the case that the same headers are 1014*e4b17023SJohn Marinorepeatedly included, and in these cases we try to avoid repeating the 1015*e4b17023SJohn Marinofilesystem queries whilst searching for the correct file. 1016*e4b17023SJohn Marino 1017*e4b17023SJohn MarinoFor each file we try to open, we store the constructed path in a splay 1018*e4b17023SJohn Marinotree. This path first undergoes simplification by the function 1019*e4b17023SJohn Marino@code{_cpp_simplify_pathname}. For example, 1020*e4b17023SJohn Marino@file{/usr/include/bits/../foo.h} is simplified to 1021*e4b17023SJohn Marino@file{/usr/include/foo.h} before we enter it in the splay tree and try 1022*e4b17023SJohn Marinoto @code{open ()} the file. CPP will then find subsequent uses of 1023*e4b17023SJohn Marino@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and 1024*e4b17023SJohn Marinosave system calls. 1025*e4b17023SJohn Marino 1026*e4b17023SJohn MarinoFurther, it is likely the file contents have also been cached, saving a 1027*e4b17023SJohn Marino@code{read ()} system call. We don't bother caching the contents of 1028*e4b17023SJohn Marinoheader files that are re-inclusion protected, and whose re-inclusion 1029*e4b17023SJohn Marinomacro is defined when we leave the header file for the first time. If 1030*e4b17023SJohn Marinothe host supports it, we try to map suitably large files into memory, 1031*e4b17023SJohn Marinorather than reading them in directly. 1032*e4b17023SJohn Marino 1033*e4b17023SJohn MarinoThe include paths are internally stored on a null-terminated 1034*e4b17023SJohn Marinosingly-linked list, starting with the @code{"header.h"} directory search 1035*e4b17023SJohn Marinochain, which then links into the @code{<header.h>} directory chain. 1036*e4b17023SJohn Marino 1037*e4b17023SJohn MarinoFiles included with the @code{<foo.h>} syntax start the lookup directly 1038*e4b17023SJohn Marinoin the second half of this chain. However, files included with the 1039*e4b17023SJohn Marino@code{"foo.h"} syntax start at the beginning of the chain, but with one 1040*e4b17023SJohn Marinoextra directory prepended. This is the directory of the current file; 1041*e4b17023SJohn Marinothe one containing the @code{#include} directive. Prepending this 1042*e4b17023SJohn Marinodirectory on a per-file basis is handled by the function 1043*e4b17023SJohn Marino@code{search_from}. 1044*e4b17023SJohn Marino 1045*e4b17023SJohn MarinoNote that a header included with a directory component, such as 1046*e4b17023SJohn Marino@code{#include "mydir/foo.h"} and opened as 1047*e4b17023SJohn Marino@file{/usr/local/include/mydir/foo.h}, will have the complete path minus 1048*e4b17023SJohn Marinothe basename @samp{foo.h} as the current directory. 1049*e4b17023SJohn Marino 1050*e4b17023SJohn MarinoEnough information is stored in the splay tree that CPP can immediately 1051*e4b17023SJohn Marinotell whether it can skip the header file because of the multiple include 1052*e4b17023SJohn Marinooptimization, whether the file didn't exist or couldn't be opened for 1053*e4b17023SJohn Marinosome reason, or whether the header was flagged not to be re-used, as it 1054*e4b17023SJohn Marinois with the obsolete @code{#import} directive. 1055*e4b17023SJohn Marino 1056*e4b17023SJohn MarinoFor the benefit of MS-DOS filesystems with an 8.3 filename limitation, 1057*e4b17023SJohn MarinoCPP offers the ability to treat various include file names as aliases 1058*e4b17023SJohn Marinofor the real header files with shorter names. The map from one to the 1059*e4b17023SJohn Marinoother is found in a special file called @samp{header.gcc}, stored in the 1060*e4b17023SJohn Marinocommand line (or system) include directories to which the mapping 1061*e4b17023SJohn Marinoapplies. This may be higher up the directory tree than the full path to 1062*e4b17023SJohn Marinothe file minus the base name. 1063*e4b17023SJohn Marino 1064*e4b17023SJohn Marino@node Concept Index 1065*e4b17023SJohn Marino@unnumbered Concept Index 1066*e4b17023SJohn Marino@printindex cp 1067*e4b17023SJohn Marino 1068*e4b17023SJohn Marino@bye 1069