xref: /netbsd-src/external/gpl3/gcc.old/dist/gcc/doc/cppinternals.texi (revision 8feb0f0b7eaff0608f8350bbfa3098827b4bb91b)
11debfc3dSmrg\input texinfo
21debfc3dSmrg@setfilename cppinternals.info
31debfc3dSmrg@settitle The GNU C Preprocessor Internals
41debfc3dSmrg
51debfc3dSmrg@include gcc-common.texi
61debfc3dSmrg
71debfc3dSmrg@ifinfo
81debfc3dSmrg@dircategory Software development
91debfc3dSmrg@direntry
101debfc3dSmrg* Cpplib: (cppinternals).      Cpplib internals.
111debfc3dSmrg@end direntry
121debfc3dSmrg@end ifinfo
131debfc3dSmrg
141debfc3dSmrg@c @smallbook
151debfc3dSmrg@c @cropmarks
161debfc3dSmrg@c @finalout
171debfc3dSmrg@setchapternewpage odd
181debfc3dSmrg@ifinfo
191debfc3dSmrgThis file documents the internals of the GNU C Preprocessor.
201debfc3dSmrg
21*8feb0f0bSmrgCopyright (C) 2000-2020 Free Software Foundation, Inc.
221debfc3dSmrg
231debfc3dSmrgPermission is granted to make and distribute verbatim copies of
241debfc3dSmrgthis manual provided the copyright notice and this permission notice
251debfc3dSmrgare preserved on all copies.
261debfc3dSmrg
271debfc3dSmrg@ignore
281debfc3dSmrgPermission is granted to process this file through Tex and print the
291debfc3dSmrgresults, provided the printed document carries copying permission
301debfc3dSmrgnotice identical to this one except for the removal of this paragraph
311debfc3dSmrg(this paragraph not being relevant to the printed manual).
321debfc3dSmrg
331debfc3dSmrg@end ignore
341debfc3dSmrgPermission is granted to copy and distribute modified versions of this
351debfc3dSmrgmanual under the conditions for verbatim copying, provided also that
361debfc3dSmrgthe entire resulting derived work is distributed under the terms of a
371debfc3dSmrgpermission notice identical to this one.
381debfc3dSmrg
391debfc3dSmrgPermission is granted to copy and distribute translations of this manual
401debfc3dSmrginto another language, under the above conditions for modified versions.
411debfc3dSmrg@end ifinfo
421debfc3dSmrg
431debfc3dSmrg@titlepage
441debfc3dSmrg@title Cpplib Internals
451debfc3dSmrg@versionsubtitle
461debfc3dSmrg@author Neil Booth
471debfc3dSmrg@page
481debfc3dSmrg@vskip 0pt plus 1filll
491debfc3dSmrg@c man begin COPYRIGHT
50*8feb0f0bSmrgCopyright @copyright{} 2000-2020 Free Software Foundation, Inc.
511debfc3dSmrg
521debfc3dSmrgPermission is granted to make and distribute verbatim copies of
531debfc3dSmrgthis manual provided the copyright notice and this permission notice
541debfc3dSmrgare preserved on all copies.
551debfc3dSmrg
561debfc3dSmrgPermission is granted to copy and distribute modified versions of this
571debfc3dSmrgmanual under the conditions for verbatim copying, provided also that
581debfc3dSmrgthe entire resulting derived work is distributed under the terms of a
591debfc3dSmrgpermission notice identical to this one.
601debfc3dSmrg
611debfc3dSmrgPermission is granted to copy and distribute translations of this manual
621debfc3dSmrginto another language, under the above conditions for modified versions.
631debfc3dSmrg@c man end
641debfc3dSmrg@end titlepage
651debfc3dSmrg@contents
661debfc3dSmrg@page
671debfc3dSmrg
681debfc3dSmrg@ifnottex
691debfc3dSmrg@node Top
701debfc3dSmrg@top
711debfc3dSmrg@chapter Cpplib---the GNU C Preprocessor
721debfc3dSmrg
731debfc3dSmrgThe GNU C preprocessor is
741debfc3dSmrgimplemented as a library, @dfn{cpplib}, so it can be easily shared between
751debfc3dSmrga stand-alone preprocessor, and a preprocessor integrated with the C,
761debfc3dSmrgC++ and Objective-C front ends.  It is also available for use by other
771debfc3dSmrgprograms, though this is not recommended as its exposed interface has
781debfc3dSmrgnot yet reached a point of reasonable stability.
791debfc3dSmrg
801debfc3dSmrgThe library has been written to be re-entrant, so that it can be used
811debfc3dSmrgto preprocess many files simultaneously if necessary.  It has also been
821debfc3dSmrgwritten with the preprocessing token as the fundamental unit; the
831debfc3dSmrgpreprocessor in previous versions of GCC would operate on text strings
841debfc3dSmrgas the fundamental unit.
851debfc3dSmrg
861debfc3dSmrgThis brief manual documents the internals of cpplib, and explains some
871debfc3dSmrgof the tricky issues.  It is intended that, along with the comments in
881debfc3dSmrgthe source code, a reasonably competent C programmer should be able to
891debfc3dSmrgfigure out what the code is doing, and why things have been implemented
901debfc3dSmrgthe way they have.
911debfc3dSmrg
921debfc3dSmrg@menu
931debfc3dSmrg* Conventions::         Conventions used in the code.
941debfc3dSmrg* Lexer::               The combined C, C++ and Objective-C Lexer.
951debfc3dSmrg* Hash Nodes::          All identifiers are entered into a hash table.
961debfc3dSmrg* Macro Expansion::     Macro expansion algorithm.
971debfc3dSmrg* Token Spacing::       Spacing and paste avoidance issues.
981debfc3dSmrg* Line Numbering::      Tracking location within files.
991debfc3dSmrg* Guard Macros::        Optimizing header files with guard macros.
1001debfc3dSmrg* Files::               File handling.
1011debfc3dSmrg* Concept Index::       Index.
1021debfc3dSmrg@end menu
1031debfc3dSmrg@end ifnottex
1041debfc3dSmrg
1051debfc3dSmrg@node Conventions
1061debfc3dSmrg@unnumbered Conventions
1071debfc3dSmrg@cindex interface
1081debfc3dSmrg@cindex header files
1091debfc3dSmrg
1101debfc3dSmrgcpplib has two interfaces---one is exposed internally only, and the
1111debfc3dSmrgother is for both internal and external use.
1121debfc3dSmrg
1131debfc3dSmrgThe convention is that functions and types that are exposed to multiple
1141debfc3dSmrgfiles internally are prefixed with @samp{_cpp_}, and are to be found in
1151debfc3dSmrgthe file @file{internal.h}.  Functions and types exposed to external
1161debfc3dSmrgclients are in @file{cpplib.h}, and prefixed with @samp{cpp_}.  For
1171debfc3dSmrghistorical reasons this is no longer quite true, but we should strive to
1181debfc3dSmrgstick to it.
1191debfc3dSmrg
1201debfc3dSmrgWe are striving to reduce the information exposed in @file{cpplib.h} to the
1211debfc3dSmrgbare minimum necessary, and then to keep it there.  This makes clear
1221debfc3dSmrgexactly what external clients are entitled to assume, and allows us to
1231debfc3dSmrgchange internals in the future without worrying whether library clients
1241debfc3dSmrgare perhaps relying on some kind of undocumented implementation-specific
1251debfc3dSmrgbehavior.
1261debfc3dSmrg
1271debfc3dSmrg@node Lexer
1281debfc3dSmrg@unnumbered The Lexer
1291debfc3dSmrg@cindex lexer
1301debfc3dSmrg@cindex newlines
1311debfc3dSmrg@cindex escaped newlines
1321debfc3dSmrg
1331debfc3dSmrg@section Overview
1341debfc3dSmrgThe lexer is contained in the file @file{lex.c}.  It is a hand-coded
1351debfc3dSmrglexer, and not implemented as a state machine.  It can understand C, C++
1361debfc3dSmrgand Objective-C source code, and has been extended to allow reasonably
1371debfc3dSmrgsuccessful preprocessing of assembly language.  The lexer does not make
1381debfc3dSmrgan initial pass to strip out trigraphs and escaped newlines, but handles
1391debfc3dSmrgthem as they are encountered in a single pass of the input file.  It
1401debfc3dSmrgreturns preprocessing tokens individually, not a line at a time.
1411debfc3dSmrg
1421debfc3dSmrgIt is mostly transparent to users of the library, since the library's
1431debfc3dSmrginterface for obtaining the next token, @code{cpp_get_token}, takes care
1441debfc3dSmrgof lexing new tokens, handling directives, and expanding macros as
1451debfc3dSmrgnecessary.  However, the lexer does expose some functionality so that
1461debfc3dSmrgclients of the library can easily spell a given token, such as
1471debfc3dSmrg@code{cpp_spell_token} and @code{cpp_token_len}.  These functions are
1481debfc3dSmrguseful when generating diagnostics, and for emitting the preprocessed
1491debfc3dSmrgoutput.
1501debfc3dSmrg
1511debfc3dSmrg@section Lexing a token
1521debfc3dSmrgLexing of an individual token is handled by @code{_cpp_lex_direct} and
1531debfc3dSmrgits subroutines.  In its current form the code is quite complicated,
1541debfc3dSmrgwith read ahead characters and such-like, since it strives to not step
1551debfc3dSmrgback in the character stream in preparation for handling non-ASCII file
1561debfc3dSmrgencodings.  The current plan is to convert any such files to UTF-8
1571debfc3dSmrgbefore processing them.  This complexity is therefore unnecessary and
1581debfc3dSmrgwill be removed, so I'll not discuss it further here.
1591debfc3dSmrg
1601debfc3dSmrgThe job of @code{_cpp_lex_direct} is simply to lex a token.  It is not
1611debfc3dSmrgresponsible for issues like directive handling, returning lookahead
1621debfc3dSmrgtokens directly, multiple-include optimization, or conditional block
1631debfc3dSmrgskipping.  It necessarily has a minor r@^ole to play in memory
1641debfc3dSmrgmanagement of lexed lines.  I discuss these issues in a separate section
1651debfc3dSmrg(@pxref{Lexing a line}).
1661debfc3dSmrg
1671debfc3dSmrgThe lexer places the token it lexes into storage pointed to by the
1681debfc3dSmrgvariable @code{cur_token}, and then increments it.  This variable is
1691debfc3dSmrgimportant for correct diagnostic positioning.  Unless a specific line
1701debfc3dSmrgand column are passed to the diagnostic routines, they will examine the
1711debfc3dSmrg@code{line} and @code{col} values of the token just before the location
1721debfc3dSmrgthat @code{cur_token} points to, and use that location to report the
1731debfc3dSmrgdiagnostic.
1741debfc3dSmrg
1751debfc3dSmrgThe lexer does not consider whitespace to be a token in its own right.
1761debfc3dSmrgIf whitespace (other than a new line) precedes a token, it sets the
1771debfc3dSmrg@code{PREV_WHITE} bit in the token's flags.  Each token has its
1781debfc3dSmrg@code{line} and @code{col} variables set to the line and column of the
1791debfc3dSmrgfirst character of the token.  This line number is the line number in
1801debfc3dSmrgthe translation unit, and can be converted to a source (file, line) pair
1811debfc3dSmrgusing the line map code.
1821debfc3dSmrg
1831debfc3dSmrgThe first token on a logical, i.e.@: unescaped, line has the flag
1841debfc3dSmrg@code{BOL} set for beginning-of-line.  This flag is intended for
1851debfc3dSmrginternal use, both to distinguish a @samp{#} that begins a directive
1861debfc3dSmrgfrom one that doesn't, and to generate a call-back to clients that want
1871debfc3dSmrgto be notified about the start of every non-directive line with tokens
1881debfc3dSmrgon it.  Clients cannot reliably determine this for themselves: the first
1891debfc3dSmrgtoken might be a macro, and the tokens of a macro expansion do not have
1901debfc3dSmrgthe @code{BOL} flag set.  The macro expansion may even be empty, and the
1911debfc3dSmrgnext token on the line certainly won't have the @code{BOL} flag set.
1921debfc3dSmrg
1931debfc3dSmrgNew lines are treated specially; exactly how the lexer handles them is
1941debfc3dSmrgcontext-dependent.  The C standard mandates that directives are
1951debfc3dSmrgterminated by the first unescaped newline character, even if it appears
1961debfc3dSmrgin the middle of a macro expansion.  Therefore, if the state variable
1971debfc3dSmrg@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
1981debfc3dSmrgwhich is normally used to indicate end-of-file, to indicate
1991debfc3dSmrgend-of-directive.  In a directive a @code{CPP_EOF} token never means
2001debfc3dSmrgend-of-file.  Conveniently, if the caller was @code{collect_args}, it
2011debfc3dSmrgalready handles @code{CPP_EOF} as if it were end-of-file, and reports an
2021debfc3dSmrgerror about an unterminated macro argument list.
2031debfc3dSmrg
2041debfc3dSmrgThe C standard also specifies that a new line in the middle of the
2051debfc3dSmrgarguments to a macro is treated as whitespace.  This white space is
2061debfc3dSmrgimportant in case the macro argument is stringized.  The state variable
2071debfc3dSmrg@code{parsing_args} is nonzero when the preprocessor is collecting the
2081debfc3dSmrgarguments to a macro call.  It is set to 1 when looking for the opening
2091debfc3dSmrgparenthesis to a function-like macro, and 2 when collecting the actual
2101debfc3dSmrgarguments up to the closing parenthesis, since these two cases need to
2111debfc3dSmrgbe distinguished sometimes.  One such time is here: the lexer sets the
2121debfc3dSmrg@code{PREV_WHITE} flag of a token if it meets a new line when
2131debfc3dSmrg@code{parsing_args} is set to 2.  It doesn't set it if it meets a new
2141debfc3dSmrgline when @code{parsing_args} is 1, since then code like
2151debfc3dSmrg
2161debfc3dSmrg@smallexample
2171debfc3dSmrg#define foo() bar
2181debfc3dSmrgfoo
2191debfc3dSmrgbaz
2201debfc3dSmrg@end smallexample
2211debfc3dSmrg
2221debfc3dSmrg@noindent would be output with an erroneous space before @samp{baz}:
2231debfc3dSmrg
2241debfc3dSmrg@smallexample
2251debfc3dSmrgfoo
2261debfc3dSmrg baz
2271debfc3dSmrg@end smallexample
2281debfc3dSmrg
2291debfc3dSmrgThis is a good example of the subtlety of getting token spacing correct
2301debfc3dSmrgin the preprocessor; there are plenty of tests in the testsuite for
2311debfc3dSmrgcorner cases like this.
2321debfc3dSmrg
2331debfc3dSmrgThe lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
2341debfc3dSmrgand @samp{\n\r} as a single new line indicator.  This allows it to
2351debfc3dSmrgtransparently preprocess MS-DOS, Macintosh and Unix files without their
2361debfc3dSmrgneeding to pass through a special filter beforehand.
2371debfc3dSmrg
2381debfc3dSmrgWe also decided to treat a backslash, either @samp{\} or the trigraph
2391debfc3dSmrg@samp{??/}, separated from one of the above newline indicators by
2401debfc3dSmrgnon-comment whitespace only, as intending to escape the newline.  It
2411debfc3dSmrgtends to be a typing mistake, and cannot reasonably be mistaken for
2421debfc3dSmrganything else in any of the C-family grammars.  Since handling it this
2431debfc3dSmrgway is not strictly conforming to the ISO standard, the library issues a
2441debfc3dSmrgwarning wherever it encounters it.
2451debfc3dSmrg
2461debfc3dSmrgHandling newlines like this is made simpler by doing it in one place
2471debfc3dSmrgonly.  The function @code{handle_newline} takes care of all newline
2481debfc3dSmrgcharacters, and @code{skip_escaped_newlines} takes care of arbitrarily
2491debfc3dSmrglong sequences of escaped newlines, deferring to @code{handle_newline}
2501debfc3dSmrgto handle the newlines themselves.
2511debfc3dSmrg
2521debfc3dSmrgThe most painful aspect of lexing ISO-standard C and C++ is handling
2531debfc3dSmrgtrigraphs and backlash-escaped newlines.  Trigraphs are processed before
2541debfc3dSmrgany interpretation of the meaning of a character is made, and unfortunately
2551debfc3dSmrgthere is a trigraph representation for a backslash, so it is possible for
2561debfc3dSmrgthe trigraph @samp{??/} to introduce an escaped newline.
2571debfc3dSmrg
2581debfc3dSmrgEscaped newlines are tedious because theoretically they can occur
2591debfc3dSmrganywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,
2601debfc3dSmrgwithin the characters of an identifier, and even between the @samp{*}
2611debfc3dSmrgand @samp{/} that terminates a comment.  Moreover, you cannot be sure
2621debfc3dSmrgthere is just one---there might be an arbitrarily long sequence of them.
2631debfc3dSmrg
2641debfc3dSmrgSo, for example, the routine that lexes a number, @code{parse_number},
2651debfc3dSmrgcannot assume that it can scan forwards until the first non-number
2661debfc3dSmrgcharacter and be done with it, because this could be the @samp{\}
2671debfc3dSmrgintroducing an escaped newline, or the @samp{?} introducing the trigraph
2681debfc3dSmrgsequence that represents the @samp{\} of an escaped newline.  If it
2691debfc3dSmrgencounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
2701debfc3dSmrgto skip over any potential escaped newlines before checking whether the
2711debfc3dSmrgnumber has been finished.
2721debfc3dSmrg
2731debfc3dSmrgSimilarly code in the main body of @code{_cpp_lex_direct} cannot simply
2741debfc3dSmrgcheck for a @samp{=} after a @samp{+} character to determine whether it
2751debfc3dSmrghas a @samp{+=} token; it needs to be prepared for an escaped newline of
2761debfc3dSmrgsome sort.  Such cases use the function @code{get_effective_char}, which
2771debfc3dSmrgreturns the first character after any intervening escaped newlines.
2781debfc3dSmrg
2791debfc3dSmrgThe lexer needs to keep track of the correct column position, including
2801debfc3dSmrgcounting tabs as specified by the @option{-ftabstop=} option.  This
2811debfc3dSmrgshould be done even within C-style comments; they can appear in the
2821debfc3dSmrgmiddle of a line, and we want to report diagnostics in the correct
2831debfc3dSmrgposition for text appearing after the end of the comment.
2841debfc3dSmrg
2851debfc3dSmrg@anchor{Invalid identifiers}
2861debfc3dSmrgSome identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
2871debfc3dSmrgmay be invalid and require a diagnostic.  However, if they appear in a
2881debfc3dSmrgmacro expansion we don't want to complain with each use of the macro.
2891debfc3dSmrgIt is therefore best to catch them during the lexing stage, in
2901debfc3dSmrg@code{parse_identifier}.  In both cases, whether a diagnostic is needed
2911debfc3dSmrgor not is dependent upon the lexer's state.  For example, we don't want
2921debfc3dSmrgto issue a diagnostic for re-poisoning a poisoned identifier, or for
2931debfc3dSmrgusing @code{__VA_ARGS__} in the expansion of a variable-argument macro.
2941debfc3dSmrgTherefore @code{parse_identifier} makes use of state flags to determine
2951debfc3dSmrgwhether a diagnostic is appropriate.  Since we change state on a
2961debfc3dSmrgper-token basis, and don't lex whole lines at a time, this is not a
2971debfc3dSmrgproblem.
2981debfc3dSmrg
2991debfc3dSmrgAnother place where state flags are used to change behavior is whilst
3001debfc3dSmrglexing header names.  Normally, a @samp{<} would be lexed as a single
3011debfc3dSmrgtoken.  After a @code{#include} directive, though, it should be lexed as
3021debfc3dSmrga single token as far as the nearest @samp{>} character.  Note that we
3031debfc3dSmrgdon't allow the terminators of header names to be escaped; the first
3041debfc3dSmrg@samp{"} or @samp{>} terminates the header name.
3051debfc3dSmrg
3061debfc3dSmrgInterpretation of some character sequences depends upon whether we are
3071debfc3dSmrglexing C, C++ or Objective-C, and on the revision of the standard in
3081debfc3dSmrgforce.  For example, @samp{::} is a single token in C++, but in C it is
3091debfc3dSmrgtwo separate @samp{:} tokens and almost certainly a syntax error.  Such
3101debfc3dSmrgcases are handled by @code{_cpp_lex_direct} based upon command-line
3111debfc3dSmrgflags stored in the @code{cpp_options} structure.
3121debfc3dSmrg
3131debfc3dSmrgOnce a token has been lexed, it leads an independent existence.  The
3141debfc3dSmrgspelling of numbers, identifiers and strings is copied to permanent
3151debfc3dSmrgstorage from the original input buffer, so a token remains valid and
3161debfc3dSmrgcorrect even if its source buffer is freed with @code{_cpp_pop_buffer}.
3171debfc3dSmrgThe storage holding the spellings of such tokens remains until the
3181debfc3dSmrgclient program calls cpp_destroy, probably at the end of the translation
3191debfc3dSmrgunit.
3201debfc3dSmrg
3211debfc3dSmrg@anchor{Lexing a line}
3221debfc3dSmrg@section Lexing a line
3231debfc3dSmrg@cindex token run
3241debfc3dSmrg
3251debfc3dSmrgWhen the preprocessor was changed to return pointers to tokens, one
3261debfc3dSmrgfeature I wanted was some sort of guarantee regarding how long a
3271debfc3dSmrgreturned pointer remains valid.  This is important to the stand-alone
3281debfc3dSmrgpreprocessor, the future direction of the C family front ends, and even
3291debfc3dSmrgto cpplib itself internally.
3301debfc3dSmrg
3311debfc3dSmrgOccasionally the preprocessor wants to be able to peek ahead in the
3321debfc3dSmrgtoken stream.  For example, after the name of a function-like macro, it
3331debfc3dSmrgwants to check the next token to see if it is an opening parenthesis.
3341debfc3dSmrgAnother example is that, after reading the first few tokens of a
3351debfc3dSmrg@code{#pragma} directive and not recognizing it as a registered pragma,
3361debfc3dSmrgit wants to backtrack and allow the user-defined handler for unknown
3371debfc3dSmrgpragmas to access the full @code{#pragma} token stream.  The stand-alone
3381debfc3dSmrgpreprocessor wants to be able to test the current token with the
3391debfc3dSmrgprevious one to see if a space needs to be inserted to preserve their
3401debfc3dSmrgseparate tokenization upon re-lexing (paste avoidance), so it needs to
3411debfc3dSmrgbe sure the pointer to the previous token is still valid.  The
3421debfc3dSmrgrecursive-descent C++ parser wants to be able to perform tentative
3431debfc3dSmrgparsing arbitrarily far ahead in the token stream, and then to be able
3441debfc3dSmrgto jump back to a prior position in that stream if necessary.
3451debfc3dSmrg
3461debfc3dSmrgThe rule I chose, which is fairly natural, is to arrange that the
3471debfc3dSmrgpreprocessor lex all tokens on a line consecutively into a token buffer,
3481debfc3dSmrgwhich I call a @dfn{token run}, and when meeting an unescaped new line
3491debfc3dSmrg(newlines within comments do not count either), to start lexing back at
3501debfc3dSmrgthe beginning of the run.  Note that we do @emph{not} lex a line of
3511debfc3dSmrgtokens at once; if we did that @code{parse_identifier} would not have
3521debfc3dSmrgstate flags available to warn about invalid identifiers (@pxref{Invalid
3531debfc3dSmrgidentifiers}).
3541debfc3dSmrg
3551debfc3dSmrgIn other words, accessing tokens that appeared earlier in the current
3561debfc3dSmrgline is valid, but since each logical line overwrites the tokens of the
3571debfc3dSmrgprevious line, tokens from prior lines are unavailable.  In particular,
3581debfc3dSmrgsince a directive only occupies a single logical line, this means that
3591debfc3dSmrgthe directive handlers like the @code{#pragma} handler can jump around
3601debfc3dSmrgin the directive's tokens if necessary.
3611debfc3dSmrg
3621debfc3dSmrgTwo issues remain: what about tokens that arise from macro expansions,
3631debfc3dSmrgand what happens when we have a long line that overflows the token run?
3641debfc3dSmrg
3651debfc3dSmrgSince we promise clients that we preserve the validity of pointers that
3661debfc3dSmrgwe have already returned for tokens that appeared earlier in the line,
3671debfc3dSmrgwe cannot reallocate the run.  Instead, on overflow it is expanded by
3681debfc3dSmrgchaining a new token run on to the end of the existing one.
3691debfc3dSmrg
3701debfc3dSmrgThe tokens forming a macro's replacement list are collected by the
3711debfc3dSmrg@code{#define} handler, and placed in storage that is only freed by
3721debfc3dSmrg@code{cpp_destroy}.  So if a macro is expanded in the line of tokens,
3731debfc3dSmrgthe pointers to the tokens of its expansion that are returned will always
3741debfc3dSmrgremain valid.  However, macros are a little trickier than that, since
3751debfc3dSmrgthey give rise to three sources of fresh tokens.  They are the built-in
3761debfc3dSmrgmacros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
3771debfc3dSmrgfor stringizing and token pasting.  I handled this by allocating
3781debfc3dSmrgspace for these tokens from the lexer's token run chain.  This means
3791debfc3dSmrgthey automatically receive the same lifetime guarantees as lexed tokens,
3801debfc3dSmrgand we don't need to concern ourselves with freeing them.
3811debfc3dSmrg
3821debfc3dSmrgLexing into a line of tokens solves some of the token memory management
3831debfc3dSmrgissues, but not all.  The opening parenthesis after a function-like
3841debfc3dSmrgmacro name might lie on a different line, and the front ends definitely
3851debfc3dSmrgwant the ability to look ahead past the end of the current line.  So
3861debfc3dSmrgcpplib only moves back to the start of the token run at the end of a
3871debfc3dSmrgline if the variable @code{keep_tokens} is zero.  Line-buffering is
3881debfc3dSmrgquite natural for the preprocessor, and as a result the only time cpplib
3891debfc3dSmrgneeds to increment this variable is whilst looking for the opening
3901debfc3dSmrgparenthesis to, and reading the arguments of, a function-like macro.  In
3911debfc3dSmrgthe near future cpplib will export an interface to increment and
3921debfc3dSmrgdecrement this variable, so that clients can share full control over the
3931debfc3dSmrglifetime of token pointers too.
3941debfc3dSmrg
3951debfc3dSmrgThe routine @code{_cpp_lex_token} handles moving to new token runs,
3961debfc3dSmrgcalling @code{_cpp_lex_direct} to lex new tokens, or returning
3971debfc3dSmrgpreviously-lexed tokens if we stepped back in the token stream.  It also
3981debfc3dSmrgchecks each token for the @code{BOL} flag, which might indicate a
3991debfc3dSmrgdirective that needs to be handled, or require a start-of-line call-back
4001debfc3dSmrgto be made.  @code{_cpp_lex_token} also handles skipping over tokens in
4011debfc3dSmrgfailed conditional blocks, and invalidates the control macro of the
4021debfc3dSmrgmultiple-include optimization if a token was successfully lexed outside
4031debfc3dSmrga directive.  In other words, its callers do not need to concern
4041debfc3dSmrgthemselves with such issues.
4051debfc3dSmrg
4061debfc3dSmrg@node Hash Nodes
4071debfc3dSmrg@unnumbered Hash Nodes
4081debfc3dSmrg@cindex hash table
4091debfc3dSmrg@cindex identifiers
4101debfc3dSmrg@cindex macros
4111debfc3dSmrg@cindex assertions
4121debfc3dSmrg@cindex named operators
4131debfc3dSmrg
4141debfc3dSmrgWhen cpplib encounters an ``identifier'', it generates a hash code for
4151debfc3dSmrgit and stores it in the hash table.  By ``identifier'' we mean tokens
4161debfc3dSmrgwith type @code{CPP_NAME}; this includes identifiers in the usual C
4171debfc3dSmrgsense, as well as keywords, directive names, macro names and so on.  For
4181debfc3dSmrgexample, all of @code{pragma}, @code{int}, @code{foo} and
4191debfc3dSmrg@code{__GNUC__} are identifiers and hashed when lexed.
4201debfc3dSmrg
4211debfc3dSmrgEach node in the hash table contain various information about the
4221debfc3dSmrgidentifier it represents.  For example, its length and type.  At any one
4231debfc3dSmrgtime, each identifier falls into exactly one of three categories:
4241debfc3dSmrg
4251debfc3dSmrg@itemize @bullet
4261debfc3dSmrg@item Macros
4271debfc3dSmrg
4281debfc3dSmrgThese have been declared to be macros, either on the command line or
4291debfc3dSmrgwith @code{#define}.  A few, such as @code{__TIME__} are built-ins
4301debfc3dSmrgentered in the hash table during initialization.  The hash node for a
4311debfc3dSmrgnormal macro points to a structure with more information about the
4321debfc3dSmrgmacro, such as whether it is function-like, how many arguments it takes,
4331debfc3dSmrgand its expansion.  Built-in macros are flagged as special, and instead
4341debfc3dSmrgcontain an enum indicating which of the various built-in macros it is.
4351debfc3dSmrg
4361debfc3dSmrg@item Assertions
4371debfc3dSmrg
4381debfc3dSmrgAssertions are in a separate namespace to macros.  To enforce this, cpp
4391debfc3dSmrgactually prepends a @code{#} character before hashing and entering it in
4401debfc3dSmrgthe hash table.  An assertion's node points to a chain of answers to
4411debfc3dSmrgthat assertion.
4421debfc3dSmrg
4431debfc3dSmrg@item Void
4441debfc3dSmrg
4451debfc3dSmrgEverything else falls into this category---an identifier that is not
4461debfc3dSmrgcurrently a macro, or a macro that has since been undefined with
4471debfc3dSmrg@code{#undef}.
4481debfc3dSmrg
4491debfc3dSmrgWhen preprocessing C++, this category also includes the named operators,
4501debfc3dSmrgsuch as @code{xor}.  In expressions these behave like the operators they
4511debfc3dSmrgrepresent, but in contexts where the spelling of a token matters they
4521debfc3dSmrgare spelt differently.  This spelling distinction is relevant when they
4531debfc3dSmrgare operands of the stringizing and pasting macro operators @code{#} and
4541debfc3dSmrg@code{##}.  Named operator hash nodes are flagged, both to catch the
4551debfc3dSmrgspelling distinction and to prevent them from being defined as macros.
4561debfc3dSmrg@end itemize
4571debfc3dSmrg
4581debfc3dSmrgThe same identifiers share the same hash node.  Since each identifier
4591debfc3dSmrgtoken, after lexing, contains a pointer to its hash node, this is used
4601debfc3dSmrgto provide rapid lookup of various information.  For example, when
4611debfc3dSmrgparsing a @code{#define} statement, CPP flags each argument's identifier
4621debfc3dSmrghash node with the index of that argument.  This makes duplicated
4631debfc3dSmrgargument checking an O(1) operation for each argument.  Similarly, for
4641debfc3dSmrgeach identifier in the macro's expansion, lookup to see if it is an
4651debfc3dSmrgargument, and which argument it is, is also an O(1) operation.  Further,
4661debfc3dSmrgeach directive name, such as @code{endif}, has an associated directive
4671debfc3dSmrgenum stored in its hash node, so that directive lookup is also O(1).
4681debfc3dSmrg
4691debfc3dSmrg@node Macro Expansion
4701debfc3dSmrg@unnumbered Macro Expansion Algorithm
4711debfc3dSmrg@cindex macro expansion
4721debfc3dSmrg
4731debfc3dSmrgMacro expansion is a tricky operation, fraught with nasty corner cases
4741debfc3dSmrgand situations that render what you thought was a nifty way to
4751debfc3dSmrgoptimize the preprocessor's expansion algorithm wrong in quite subtle
4761debfc3dSmrgways.
4771debfc3dSmrg
4781debfc3dSmrgI strongly recommend you have a good grasp of how the C and C++
4791debfc3dSmrgstandards require macros to be expanded before diving into this
4801debfc3dSmrgsection, let alone the code!.  If you don't have a clear mental
4811debfc3dSmrgpicture of how things like nested macro expansion, stringizing and
4821debfc3dSmrgtoken pasting are supposed to work, damage to your sanity can quickly
4831debfc3dSmrgresult.
4841debfc3dSmrg
4851debfc3dSmrg@section Internal representation of macros
4861debfc3dSmrg@cindex macro representation (internal)
4871debfc3dSmrg
4881debfc3dSmrgThe preprocessor stores macro expansions in tokenized form.  This
4891debfc3dSmrgsaves repeated lexing passes during expansion, at the cost of a small
4901debfc3dSmrgincrease in memory consumption on average.  The tokens are stored
4911debfc3dSmrgcontiguously in memory, so a pointer to the first one and a token
4921debfc3dSmrgcount is all you need to get the replacement list of a macro.
4931debfc3dSmrg
4941debfc3dSmrgIf the macro is a function-like macro the preprocessor also stores its
4951debfc3dSmrgparameters, in the form of an ordered list of pointers to the hash
4961debfc3dSmrgtable entry of each parameter's identifier.  Further, in the macro's
4971debfc3dSmrgstored expansion each occurrence of a parameter is replaced with a
4981debfc3dSmrgspecial token of type @code{CPP_MACRO_ARG}.  Each such token holds the
4991debfc3dSmrgindex of the parameter it represents in the parameter list, which
5001debfc3dSmrgallows rapid replacement of parameters with their arguments during
5011debfc3dSmrgexpansion.  Despite this optimization it is still necessary to store
5021debfc3dSmrgthe original parameters to the macro, both for dumping with e.g.,
5031debfc3dSmrg@option{-dD}, and to warn about non-trivial macro redefinitions when
5041debfc3dSmrgthe parameter names have changed.
5051debfc3dSmrg
5061debfc3dSmrg@section Macro expansion overview
5071debfc3dSmrgThe preprocessor maintains a @dfn{context stack}, implemented as a
5081debfc3dSmrglinked list of @code{cpp_context} structures, which together represent
5091debfc3dSmrgthe macro expansion state at any one time.  The @code{struct
5101debfc3dSmrgcpp_reader} member variable @code{context} points to the current top
5111debfc3dSmrgof this stack.  The top normally holds the unexpanded replacement list
5121debfc3dSmrgof the innermost macro under expansion, except when cpplib is about to
5131debfc3dSmrgpre-expand an argument, in which case it holds that argument's
5141debfc3dSmrgunexpanded tokens.
5151debfc3dSmrg
5161debfc3dSmrgWhen there are no macros under expansion, cpplib is in @dfn{base
5171debfc3dSmrgcontext}.  All contexts other than the base context contain a
5181debfc3dSmrgcontiguous list of tokens delimited by a starting and ending token.
5191debfc3dSmrgWhen not in base context, cpplib obtains the next token from the list
5201debfc3dSmrgof the top context.  If there are no tokens left in the list, it pops
5211debfc3dSmrgthat context off the stack, and subsequent ones if necessary, until an
5221debfc3dSmrgunexhausted context is found or it returns to base context.  In base
5231debfc3dSmrgcontext, cpplib reads tokens directly from the lexer.
5241debfc3dSmrg
5251debfc3dSmrgIf it encounters an identifier that is both a macro and enabled for
5261debfc3dSmrgexpansion, cpplib prepares to push a new context for that macro on the
5271debfc3dSmrgstack by calling the routine @code{enter_macro_context}.  When this
5281debfc3dSmrgroutine returns, the new context will contain the unexpanded tokens of
5291debfc3dSmrgthe replacement list of that macro.  In the case of function-like
5301debfc3dSmrgmacros, @code{enter_macro_context} also replaces any parameters in the
5311debfc3dSmrgreplacement list, stored as @code{CPP_MACRO_ARG} tokens, with the
5321debfc3dSmrgappropriate macro argument.  If the standard requires that the
5331debfc3dSmrgparameter be replaced with its expanded argument, the argument will
5341debfc3dSmrghave been fully macro expanded first.
5351debfc3dSmrg
5361debfc3dSmrg@code{enter_macro_context} also handles special macros like
5371debfc3dSmrg@code{__LINE__}.  Although these macros expand to a single token which
5381debfc3dSmrgcannot contain any further macros, for reasons of token spacing
5391debfc3dSmrg(@pxref{Token Spacing}) and simplicity of implementation, cpplib
5401debfc3dSmrghandles these special macros by pushing a context containing just that
5411debfc3dSmrgone token.
5421debfc3dSmrg
5431debfc3dSmrgThe final thing that @code{enter_macro_context} does before returning
5441debfc3dSmrgis to mark the macro disabled for expansion (except for special macros
5451debfc3dSmrglike @code{__TIME__}).  The macro is re-enabled when its context is
5461debfc3dSmrglater popped from the context stack, as described above.  This strict
5471debfc3dSmrgordering ensures that a macro is disabled whilst its expansion is
5481debfc3dSmrgbeing scanned, but that it is @emph{not} disabled whilst any arguments
5491debfc3dSmrgto it are being expanded.
5501debfc3dSmrg
5511debfc3dSmrg@section Scanning the replacement list for macros to expand
5521debfc3dSmrgThe C standard states that, after any parameters have been replaced
5531debfc3dSmrgwith their possibly-expanded arguments, the replacement list is
5541debfc3dSmrgscanned for nested macros.  Further, any identifiers in the
5551debfc3dSmrgreplacement list that are not expanded during this scan are never
5561debfc3dSmrgagain eligible for expansion in the future, if the reason they were
5571debfc3dSmrgnot expanded is that the macro in question was disabled.
5581debfc3dSmrg
5591debfc3dSmrgClearly this latter condition can only apply to tokens resulting from
5601debfc3dSmrgargument pre-expansion.  Other tokens never have an opportunity to be
5611debfc3dSmrgre-tested for expansion.  It is possible for identifiers that are
5621debfc3dSmrgfunction-like macros to not expand initially but to expand during a
5631debfc3dSmrglater scan.  This occurs when the identifier is the last token of an
5641debfc3dSmrgargument (and therefore originally followed by a comma or a closing
5651debfc3dSmrgparenthesis in its macro's argument list), and when it replaces its
5661debfc3dSmrgparameter in the macro's replacement list, the subsequent token
5671debfc3dSmrghappens to be an opening parenthesis (itself possibly the first token
5681debfc3dSmrgof an argument).
5691debfc3dSmrg
5701debfc3dSmrgIt is important to note that when cpplib reads the last token of a
5711debfc3dSmrggiven context, that context still remains on the stack.  Only when
5721debfc3dSmrglooking for the @emph{next} token do we pop it off the stack and drop
5731debfc3dSmrgto a lower context.  This makes backing up by one token easy, but more
5741debfc3dSmrgimportantly ensures that the macro corresponding to the current
5751debfc3dSmrgcontext is still disabled when we are considering the last token of
5761debfc3dSmrgits replacement list for expansion (or indeed expanding it).  As an
5771debfc3dSmrgexample, which illustrates many of the points above, consider
5781debfc3dSmrg
5791debfc3dSmrg@smallexample
5801debfc3dSmrg#define foo(x) bar x
5811debfc3dSmrgfoo(foo) (2)
5821debfc3dSmrg@end smallexample
5831debfc3dSmrg
5841debfc3dSmrg@noindent which fully expands to @samp{bar foo (2)}.  During pre-expansion
5851debfc3dSmrgof the argument, @samp{foo} does not expand even though the macro is
5861debfc3dSmrgenabled, since it has no following parenthesis [pre-expansion of an
5871debfc3dSmrgargument only uses tokens from that argument; it cannot take tokens
5881debfc3dSmrgfrom whatever follows the macro invocation].  This still leaves the
5891debfc3dSmrgargument token @samp{foo} eligible for future expansion.  Then, when
5901debfc3dSmrgre-scanning after argument replacement, the token @samp{foo} is
5911debfc3dSmrgrejected for expansion, and marked ineligible for future expansion,
5921debfc3dSmrgsince the macro is now disabled.  It is disabled because the
5931debfc3dSmrgreplacement list @samp{bar foo} of the macro is still on the context
5941debfc3dSmrgstack.
5951debfc3dSmrg
5961debfc3dSmrgIf instead the algorithm looked for an opening parenthesis first and
5971debfc3dSmrgthen tested whether the macro were disabled it would be subtly wrong.
5981debfc3dSmrgIn the example above, the replacement list of @samp{foo} would be
5991debfc3dSmrgpopped in the process of finding the parenthesis, re-enabling
6001debfc3dSmrg@samp{foo} and expanding it a second time.
6011debfc3dSmrg
6021debfc3dSmrg@section Looking for a function-like macro's opening parenthesis
6031debfc3dSmrgFunction-like macros only expand when immediately followed by a
6041debfc3dSmrgparenthesis.  To do this cpplib needs to temporarily disable macros
6051debfc3dSmrgand read the next token.  Unfortunately, because of spacing issues
6061debfc3dSmrg(@pxref{Token Spacing}), there can be fake padding tokens in-between,
6071debfc3dSmrgand if the next real token is not a parenthesis cpplib needs to be
6081debfc3dSmrgable to back up that one token as well as retain the information in
6091debfc3dSmrgany intervening padding tokens.
6101debfc3dSmrg
6111debfc3dSmrgBacking up more than one token when macros are involved is not
6121debfc3dSmrgpermitted by cpplib, because in general it might involve issues like
6131debfc3dSmrgrestoring popped contexts onto the context stack, which are too hard.
6141debfc3dSmrgInstead, searching for the parenthesis is handled by a special
6151debfc3dSmrgfunction, @code{funlike_invocation_p}, which remembers padding
6161debfc3dSmrginformation as it reads tokens.  If the next real token is not an
6171debfc3dSmrgopening parenthesis, it backs up that one token, and then pushes an
6181debfc3dSmrgextra context just containing the padding information if necessary.
6191debfc3dSmrg
6201debfc3dSmrg@section Marking tokens ineligible for future expansion
6211debfc3dSmrgAs discussed above, cpplib needs a way of marking tokens as
6221debfc3dSmrgunexpandable.  Since the tokens cpplib handles are read-only once they
6231debfc3dSmrghave been lexed, it instead makes a copy of the token and adds the
6241debfc3dSmrgflag @code{NO_EXPAND} to the copy.
6251debfc3dSmrg
6261debfc3dSmrgFor efficiency and to simplify memory management by avoiding having to
6271debfc3dSmrgremember to free these tokens, they are allocated as temporary tokens
6281debfc3dSmrgfrom the lexer's current token run (@pxref{Lexing a line}) using the
6291debfc3dSmrgfunction @code{_cpp_temp_token}.  The tokens are then re-used once the
6301debfc3dSmrgcurrent line of tokens has been read in.
6311debfc3dSmrg
6321debfc3dSmrgThis might sound unsafe.  However, tokens runs are not re-used at the
6331debfc3dSmrgend of a line if it happens to be in the middle of a macro argument
6341debfc3dSmrglist, and cpplib only wants to back-up more than one lexer token in
6351debfc3dSmrgsituations where no macro expansion is involved, so the optimization
6361debfc3dSmrgis safe.
6371debfc3dSmrg
6381debfc3dSmrg@node Token Spacing
6391debfc3dSmrg@unnumbered Token Spacing
6401debfc3dSmrg@cindex paste avoidance
6411debfc3dSmrg@cindex spacing
6421debfc3dSmrg@cindex token spacing
6431debfc3dSmrg
6441debfc3dSmrgFirst, consider an issue that only concerns the stand-alone
6451debfc3dSmrgpreprocessor: there needs to be a guarantee that re-reading its preprocessed
6461debfc3dSmrgoutput results in an identical token stream.  Without taking special
6471debfc3dSmrgmeasures, this might not be the case because of macro substitution.
6481debfc3dSmrgFor example:
6491debfc3dSmrg
6501debfc3dSmrg@smallexample
6511debfc3dSmrg#define PLUS +
6521debfc3dSmrg#define EMPTY
6531debfc3dSmrg#define f(x) =x=
6541debfc3dSmrg+PLUS -EMPTY- PLUS+ f(=)
6551debfc3dSmrg        @expansion{} + + - - + + = = =
6561debfc3dSmrg@emph{not}
6571debfc3dSmrg        @expansion{} ++ -- ++ ===
6581debfc3dSmrg@end smallexample
6591debfc3dSmrg
6601debfc3dSmrgOne solution would be to simply insert a space between all adjacent
6611debfc3dSmrgtokens.  However, we would like to keep space insertion to a minimum,
6621debfc3dSmrgboth for aesthetic reasons and because it causes problems for people who
6631debfc3dSmrgstill try to abuse the preprocessor for things like Fortran source and
6641debfc3dSmrgMakefiles.
6651debfc3dSmrg
6661debfc3dSmrgFor now, just notice that when tokens are added (or removed, as shown by
6671debfc3dSmrgthe @code{EMPTY} example) from the original lexed token stream, we need
6681debfc3dSmrgto check for accidental token pasting.  We call this @dfn{paste
6691debfc3dSmrgavoidance}.  Token addition and removal can only occur because of macro
6701debfc3dSmrgexpansion, but accidental pasting can occur in many places: both before
6711debfc3dSmrgand after each macro replacement, each argument replacement, and
6721debfc3dSmrgadditionally each token created by the @samp{#} and @samp{##} operators.
6731debfc3dSmrg
6741debfc3dSmrgLook at how the preprocessor gets whitespace output correct
6751debfc3dSmrgnormally.  The @code{cpp_token} structure contains a flags byte, and one
6761debfc3dSmrgof those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
6771debfc3dSmrgindicates that the token was preceded by whitespace of some form other
6781debfc3dSmrgthan a new line.  The stand-alone preprocessor can use this flag to
6791debfc3dSmrgdecide whether to insert a space between tokens in the output.
6801debfc3dSmrg
6811debfc3dSmrgNow consider the result of the following macro expansion:
6821debfc3dSmrg
6831debfc3dSmrg@smallexample
6841debfc3dSmrg#define add(x, y, z) x + y +z;
6851debfc3dSmrgsum = add (1,2, 3);
6861debfc3dSmrg        @expansion{} sum = 1 + 2 +3;
6871debfc3dSmrg@end smallexample
6881debfc3dSmrg
6891debfc3dSmrgThe interesting thing here is that the tokens @samp{1} and @samp{2} are
6901debfc3dSmrgoutput with a preceding space, and @samp{3} is output without a
6911debfc3dSmrgpreceding space, but when lexed none of these tokens had that property.
6921debfc3dSmrgCareful consideration reveals that @samp{1} gets its preceding
6931debfc3dSmrgwhitespace from the space preceding @samp{add} in the macro invocation,
6941debfc3dSmrg@emph{not} replacement list.  @samp{2} gets its whitespace from the
6951debfc3dSmrgspace preceding the parameter @samp{y} in the macro replacement list,
6961debfc3dSmrgand @samp{3} has no preceding space because parameter @samp{z} has none
6971debfc3dSmrgin the replacement list.
6981debfc3dSmrg
6991debfc3dSmrgOnce lexed, tokens are effectively fixed and cannot be altered, since
7001debfc3dSmrgpointers to them might be held in many places, in particular by
7011debfc3dSmrgin-progress macro expansions.  So instead of modifying the two tokens
7021debfc3dSmrgabove, the preprocessor inserts a special token, which I call a
7031debfc3dSmrg@dfn{padding token}, into the token stream to indicate that spacing of
7041debfc3dSmrgthe subsequent token is special.  The preprocessor inserts padding
7051debfc3dSmrgtokens in front of every macro expansion and expanded macro argument.
7061debfc3dSmrgThese point to a @dfn{source token} from which the subsequent real token
7071debfc3dSmrgshould inherit its spacing.  In the above example, the source tokens are
7081debfc3dSmrg@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
7091debfc3dSmrgmacro replacement list, respectively.
7101debfc3dSmrg
7111debfc3dSmrgIt is quite easy to get multiple padding tokens in a row, for example if
7121debfc3dSmrga macro's first replacement token expands straight into another macro.
7131debfc3dSmrg
7141debfc3dSmrg@smallexample
7151debfc3dSmrg#define foo bar
7161debfc3dSmrg#define bar baz
7171debfc3dSmrg[foo]
7181debfc3dSmrg        @expansion{} [baz]
7191debfc3dSmrg@end smallexample
7201debfc3dSmrg
7211debfc3dSmrgHere, two padding tokens are generated with sources the @samp{foo} token
7221debfc3dSmrgbetween the brackets, and the @samp{bar} token from foo's replacement
7231debfc3dSmrglist, respectively.  Clearly the first padding token is the one to
7241debfc3dSmrguse, so the output code should contain a rule that the first
7251debfc3dSmrgpadding token in a sequence is the one that matters.
7261debfc3dSmrg
7271debfc3dSmrgBut what if a macro expansion is left?  Adjusting the above
7281debfc3dSmrgexample slightly:
7291debfc3dSmrg
7301debfc3dSmrg@smallexample
7311debfc3dSmrg#define foo bar
7321debfc3dSmrg#define bar EMPTY baz
7331debfc3dSmrg#define EMPTY
7341debfc3dSmrg[foo] EMPTY;
7351debfc3dSmrg        @expansion{} [ baz] ;
7361debfc3dSmrg@end smallexample
7371debfc3dSmrg
7381debfc3dSmrgAs shown, now there should be a space before @samp{baz} and the
7391debfc3dSmrgsemicolon in the output.
7401debfc3dSmrg
7411debfc3dSmrgThe rules we decided above fail for @samp{baz}: we generate three
7421debfc3dSmrgpadding tokens, one per macro invocation, before the token @samp{baz}.
7431debfc3dSmrgWe would then have it take its spacing from the first of these, which
7441debfc3dSmrgcarries source token @samp{foo} with no leading space.
7451debfc3dSmrg
7461debfc3dSmrgIt is vital that cpplib get spacing correct in these examples since any
7471debfc3dSmrgof these macro expansions could be stringized, where spacing matters.
7481debfc3dSmrg
7491debfc3dSmrgSo, this demonstrates that not just entering macro and argument
7501debfc3dSmrgexpansions, but leaving them requires special handling too.  I made
7511debfc3dSmrgcpplib insert a padding token with a @code{NULL} source token when
7521debfc3dSmrgleaving macro expansions, as well as after each replaced argument in a
7531debfc3dSmrgmacro's replacement list.  It also inserts appropriate padding tokens on
7541debfc3dSmrgeither side of tokens created by the @samp{#} and @samp{##} operators.
7551debfc3dSmrgI expanded the rule so that, if we see a padding token with a
7561debfc3dSmrg@code{NULL} source token, @emph{and} that source token has no leading
7571debfc3dSmrgspace, then we behave as if we have seen no padding tokens at all.  A
7581debfc3dSmrgquick check shows this rule will then get the above example correct as
7591debfc3dSmrgwell.
7601debfc3dSmrg
7611debfc3dSmrgNow a relationship with paste avoidance is apparent: we have to be
7621debfc3dSmrgcareful about paste avoidance in exactly the same locations we have
7631debfc3dSmrgpadding tokens in order to get white space correct.  This makes
7641debfc3dSmrgimplementation of paste avoidance easy: wherever the stand-alone
7651debfc3dSmrgpreprocessor is fixing up spacing because of padding tokens, and it
7661debfc3dSmrgturns out that no space is needed, it has to take the extra step to
7671debfc3dSmrgcheck that a space is not needed after all to avoid an accidental paste.
7681debfc3dSmrgThe function @code{cpp_avoid_paste} advises whether a space is required
7691debfc3dSmrgbetween two consecutive tokens.  To avoid excessive spacing, it tries
7701debfc3dSmrghard to only require a space if one is likely to be necessary, but for
7711debfc3dSmrgreasons of efficiency it is slightly conservative and might recommend a
7721debfc3dSmrgspace where one is not strictly needed.
7731debfc3dSmrg
7741debfc3dSmrg@node Line Numbering
7751debfc3dSmrg@unnumbered Line numbering
7761debfc3dSmrg@cindex line numbers
7771debfc3dSmrg
7781debfc3dSmrg@section Just which line number anyway?
7791debfc3dSmrg
7801debfc3dSmrgThere are three reasonable requirements a cpplib client might have for
7811debfc3dSmrgthe line number of a token passed to it:
7821debfc3dSmrg
7831debfc3dSmrg@itemize @bullet
7841debfc3dSmrg@item
7851debfc3dSmrgThe source line it was lexed on.
7861debfc3dSmrg@item
7871debfc3dSmrgThe line it is output on.  This can be different to the line it was
7881debfc3dSmrglexed on if, for example, there are intervening escaped newlines or
7891debfc3dSmrgC-style comments.  For example:
7901debfc3dSmrg
7911debfc3dSmrg@smallexample
7921debfc3dSmrgfoo /* @r{A long
7931debfc3dSmrgcomment} */ bar \
7941debfc3dSmrgbaz
7951debfc3dSmrg@result{}
7961debfc3dSmrgfoo bar baz
7971debfc3dSmrg@end smallexample
7981debfc3dSmrg
7991debfc3dSmrg@item
8001debfc3dSmrgIf the token results from a macro expansion, the line of the macro name,
8011debfc3dSmrgor possibly the line of the closing parenthesis in the case of
8021debfc3dSmrgfunction-like macro expansion.
8031debfc3dSmrg@end itemize
8041debfc3dSmrg
8051debfc3dSmrgThe @code{cpp_token} structure contains @code{line} and @code{col}
8061debfc3dSmrgmembers.  The lexer fills these in with the line and column of the first
8071debfc3dSmrgcharacter of the token.  Consequently, but maybe unexpectedly, a token
8081debfc3dSmrgfrom the replacement list of a macro expansion carries the location of
8091debfc3dSmrgthe token within the @code{#define} directive, because cpplib expands a
8101debfc3dSmrgmacro by returning pointers to the tokens in its replacement list.  The
8111debfc3dSmrgcurrent implementation of cpplib assigns tokens created from built-in
8121debfc3dSmrgmacros and the @samp{#} and @samp{##} operators the location of the most
8131debfc3dSmrgrecently lexed token.  This is a because they are allocated from the
8141debfc3dSmrglexer's token runs, and because of the way the diagnostic routines infer
8151debfc3dSmrgthe appropriate location to report.
8161debfc3dSmrg
8171debfc3dSmrgThe diagnostic routines in cpplib display the location of the most
8181debfc3dSmrgrecently @emph{lexed} token, unless they are passed a specific line and
8191debfc3dSmrgcolumn to report.  For diagnostics regarding tokens that arise from
8201debfc3dSmrgmacro expansions, it might also be helpful for the user to see the
8211debfc3dSmrgoriginal location in the macro definition that the token came from.
8221debfc3dSmrgSince that is exactly the information each token carries, such an
8231debfc3dSmrgenhancement could be made relatively easily in future.
8241debfc3dSmrg
8251debfc3dSmrgThe stand-alone preprocessor faces a similar problem when determining
8261debfc3dSmrgthe correct line to output the token on: the position attached to a
8271debfc3dSmrgtoken is fairly useless if the token came from a macro expansion.  All
8281debfc3dSmrgtokens on a logical line should be output on its first physical line, so
8291debfc3dSmrgthe token's reported location is also wrong if it is part of a physical
8301debfc3dSmrgline other than the first.
8311debfc3dSmrg
8321debfc3dSmrgTo solve these issues, cpplib provides a callback that is generated
8331debfc3dSmrgwhenever it lexes a preprocessing token that starts a new logical line
8341debfc3dSmrgother than a directive.  It passes this token (which may be a
8351debfc3dSmrg@code{CPP_EOF} token indicating the end of the translation unit) to the
8361debfc3dSmrgcallback routine, which can then use the line and column of this token
8371debfc3dSmrgto produce correct output.
8381debfc3dSmrg
8391debfc3dSmrg@section Representation of line numbers
8401debfc3dSmrg
8411debfc3dSmrgAs mentioned above, cpplib stores with each token the line number that
8421debfc3dSmrgit was lexed on.  In fact, this number is not the number of the line in
8431debfc3dSmrgthe source file, but instead bears more resemblance to the number of the
8441debfc3dSmrgline in the translation unit.
8451debfc3dSmrg
8461debfc3dSmrgThe preprocessor maintains a monotonic increasing line count, which is
8471debfc3dSmrgincremented at every new line character (and also at the end of any
8481debfc3dSmrgbuffer that does not end in a new line).  Since a line number of zero is
8491debfc3dSmrguseful to indicate certain special states and conditions, this variable
8501debfc3dSmrgstarts counting from one.
8511debfc3dSmrg
8521debfc3dSmrgThis variable therefore uniquely enumerates each line in the translation
8531debfc3dSmrgunit.  With some simple infrastructure, it is straight forward to map
8541debfc3dSmrgfrom this to the original source file and line number pair, saving space
8551debfc3dSmrgwhenever line number information needs to be saved.  The code the
8561debfc3dSmrgimplements this mapping lies in the files @file{line-map.c} and
8571debfc3dSmrg@file{line-map.h}.
8581debfc3dSmrg
8591debfc3dSmrgCommand-line macros and assertions are implemented by pushing a buffer
8601debfc3dSmrgcontaining the right hand side of an equivalent @code{#define} or
8611debfc3dSmrg@code{#assert} directive.  Some built-in macros are handled similarly.
8621debfc3dSmrgSince these are all processed before the first line of the main input
8631debfc3dSmrgfile, it will typically have an assigned line closer to twenty than to
8641debfc3dSmrgone.
8651debfc3dSmrg
8661debfc3dSmrg@node Guard Macros
8671debfc3dSmrg@unnumbered The Multiple-Include Optimization
8681debfc3dSmrg@cindex guard macros
8691debfc3dSmrg@cindex controlling macros
8701debfc3dSmrg@cindex multiple-include optimization
8711debfc3dSmrg
8721debfc3dSmrgHeader files are often of the form
8731debfc3dSmrg
8741debfc3dSmrg@smallexample
8751debfc3dSmrg#ifndef FOO
8761debfc3dSmrg#define FOO
8771debfc3dSmrg@dots{}
8781debfc3dSmrg#endif
8791debfc3dSmrg@end smallexample
8801debfc3dSmrg
8811debfc3dSmrg@noindent
8821debfc3dSmrgto prevent the compiler from processing them more than once.  The
8831debfc3dSmrgpreprocessor notices such header files, so that if the header file
8841debfc3dSmrgappears in a subsequent @code{#include} directive and @code{FOO} is
8851debfc3dSmrgdefined, then it is ignored and it doesn't preprocess or even re-open
8861debfc3dSmrgthe file a second time.  This is referred to as the @dfn{multiple
8871debfc3dSmrginclude optimization}.
8881debfc3dSmrg
8891debfc3dSmrgUnder what circumstances is such an optimization valid?  If the file
8901debfc3dSmrgwere included a second time, it can only be optimized away if that
8911debfc3dSmrginclusion would result in no tokens to return, and no relevant
8921debfc3dSmrgdirectives to process.  Therefore the current implementation imposes
8931debfc3dSmrgrequirements and makes some allowances as follows:
8941debfc3dSmrg
8951debfc3dSmrg@enumerate
8961debfc3dSmrg@item
8971debfc3dSmrgThere must be no tokens outside the controlling @code{#if}-@code{#endif}
8981debfc3dSmrgpair, but whitespace and comments are permitted.
8991debfc3dSmrg
9001debfc3dSmrg@item
9011debfc3dSmrgThere must be no directives outside the controlling directive pair, but
9021debfc3dSmrgthe @dfn{null directive} (a line containing nothing other than a single
9031debfc3dSmrg@samp{#} and possibly whitespace) is permitted.
9041debfc3dSmrg
9051debfc3dSmrg@item
9061debfc3dSmrgThe opening directive must be of the form
9071debfc3dSmrg
9081debfc3dSmrg@smallexample
9091debfc3dSmrg#ifndef FOO
9101debfc3dSmrg@end smallexample
9111debfc3dSmrg
9121debfc3dSmrgor
9131debfc3dSmrg
9141debfc3dSmrg@smallexample
9151debfc3dSmrg#if !defined FOO     [equivalently, #if !defined(FOO)]
9161debfc3dSmrg@end smallexample
9171debfc3dSmrg
9181debfc3dSmrg@item
9191debfc3dSmrgIn the second form above, the tokens forming the @code{#if} expression
9201debfc3dSmrgmust have come directly from the source file---no macro expansion must
9211debfc3dSmrghave been involved.  This is because macro definitions can change, and
9221debfc3dSmrgtracking whether or not a relevant change has been made is not worth the
9231debfc3dSmrgimplementation cost.
9241debfc3dSmrg
9251debfc3dSmrg@item
9261debfc3dSmrgThere can be no @code{#else} or @code{#elif} directives at the outer
9271debfc3dSmrgconditional block level, because they would probably contain something
9281debfc3dSmrgof interest to a subsequent pass.
9291debfc3dSmrg@end enumerate
9301debfc3dSmrg
9311debfc3dSmrgFirst, when pushing a new file on the buffer stack,
9321debfc3dSmrg@code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to
9331debfc3dSmrg@code{NULL}, and sets @code{mi_valid} to @code{true}.  This indicates
9341debfc3dSmrgthat the preprocessor has not yet encountered anything that would
9351debfc3dSmrginvalidate the multiple-include optimization.  As described in the next
9361debfc3dSmrgfew paragraphs, these two variables having these values effectively
9371debfc3dSmrgindicates top-of-file.
9381debfc3dSmrg
9391debfc3dSmrgWhen about to return a token that is not part of a directive,
9401debfc3dSmrg@code{_cpp_lex_token} sets @code{mi_valid} to @code{false}.  This
9411debfc3dSmrgenforces the constraint that tokens outside the controlling conditional
9421debfc3dSmrgblock invalidate the optimization.
9431debfc3dSmrg
9441debfc3dSmrgThe @code{do_if}, when appropriate, and @code{do_ifndef} directive
9451debfc3dSmrghandlers pass the controlling macro to the function
9461debfc3dSmrg@code{push_conditional}.  cpplib maintains a stack of nested conditional
9471debfc3dSmrgblocks, and after processing every opening conditional this function
9481debfc3dSmrgpushes an @code{if_stack} structure onto the stack.  In this structure
9491debfc3dSmrgit records the controlling macro for the block, provided there is one
9501debfc3dSmrgand we're at top-of-file (as described above).  If an @code{#elif} or
9511debfc3dSmrg@code{#else} directive is encountered, the controlling macro for that
9521debfc3dSmrgblock is cleared to @code{NULL}.  Otherwise, it survives until the
9531debfc3dSmrg@code{#endif} closing the block, upon which @code{do_endif} sets
9541debfc3dSmrg@code{mi_valid} to true and stores the controlling macro in
9551debfc3dSmrg@code{mi_cmacro}.
9561debfc3dSmrg
9571debfc3dSmrg@code{_cpp_handle_directive} clears @code{mi_valid} when processing any
9581debfc3dSmrgdirective other than an opening conditional and the null directive.
9591debfc3dSmrgWith this, and requiring top-of-file to record a controlling macro, and
9601debfc3dSmrgno @code{#else} or @code{#elif} for it to survive and be copied to
9611debfc3dSmrg@code{mi_cmacro} by @code{do_endif}, we have enforced the absence of
9621debfc3dSmrgdirectives outside the main conditional block for the optimization to be
9631debfc3dSmrgon.
9641debfc3dSmrg
9651debfc3dSmrgNote that whilst we are inside the conditional block, @code{mi_valid} is
9661debfc3dSmrglikely to be reset to @code{false}, but this does not matter since
9671debfc3dSmrgthe closing @code{#endif} restores it to @code{true} if appropriate.
9681debfc3dSmrg
9691debfc3dSmrgFinally, since @code{_cpp_lex_direct} pops the file off the buffer stack
9701debfc3dSmrgat @code{EOF} without returning a token, if the @code{#endif} directive
9711debfc3dSmrgwas not followed by any tokens, @code{mi_valid} is @code{true} and
9721debfc3dSmrg@code{_cpp_pop_file_buffer} remembers the controlling macro associated
9731debfc3dSmrgwith the file.  Subsequent calls to @code{stack_include_file} result in
9741debfc3dSmrgno buffer being pushed if the controlling macro is defined, effecting
9751debfc3dSmrgthe optimization.
9761debfc3dSmrg
9771debfc3dSmrgA quick word on how we handle the
9781debfc3dSmrg
9791debfc3dSmrg@smallexample
9801debfc3dSmrg#if !defined FOO
9811debfc3dSmrg@end smallexample
9821debfc3dSmrg
9831debfc3dSmrg@noindent
9841debfc3dSmrgcase.  @code{_cpp_parse_expr} and @code{parse_defined} take steps to see
9851debfc3dSmrgwhether the three stages @samp{!}, @samp{defined-expression} and
9861debfc3dSmrg@samp{end-of-directive} occur in order in a @code{#if} expression.  If
9871debfc3dSmrgso, they return the guard macro to @code{do_if} in the variable
9881debfc3dSmrg@code{mi_ind_cmacro}, and otherwise set it to @code{NULL}.
9891debfc3dSmrg@code{enter_macro_context} sets @code{mi_valid} to false, so if a macro
9901debfc3dSmrgwas expanded whilst parsing any part of the expression, then the
9911debfc3dSmrgtop-of-file test in @code{push_conditional} fails and the optimization
9921debfc3dSmrgis turned off.
9931debfc3dSmrg
9941debfc3dSmrg@node Files
9951debfc3dSmrg@unnumbered File Handling
9961debfc3dSmrg@cindex files
9971debfc3dSmrg
9981debfc3dSmrgFairly obviously, the file handling code of cpplib resides in the file
9991debfc3dSmrg@file{files.c}.  It takes care of the details of file searching,
10001debfc3dSmrgopening, reading and caching, for both the main source file and all the
10011debfc3dSmrgheaders it recursively includes.
10021debfc3dSmrg
10031debfc3dSmrgThe basic strategy is to minimize the number of system calls.  On many
10041debfc3dSmrgsystems, the basic @code{open ()} and @code{fstat ()} system calls can
10051debfc3dSmrgbe quite expensive.  For every @code{#include}-d file, we need to try
10061debfc3dSmrgall the directories in the search path until we find a match.  Some
10071debfc3dSmrgprojects, such as glibc, pass twenty or thirty include paths on the
10081debfc3dSmrgcommand line, so this can rapidly become time consuming.
10091debfc3dSmrg
10101debfc3dSmrgFor a header file we have not encountered before we have little choice
10111debfc3dSmrgbut to do this.  However, it is often the case that the same headers are
10121debfc3dSmrgrepeatedly included, and in these cases we try to avoid repeating the
10131debfc3dSmrgfilesystem queries whilst searching for the correct file.
10141debfc3dSmrg
10151debfc3dSmrgFor each file we try to open, we store the constructed path in a splay
10161debfc3dSmrgtree.  This path first undergoes simplification by the function
10171debfc3dSmrg@code{_cpp_simplify_pathname}.  For example,
10181debfc3dSmrg@file{/usr/include/bits/../foo.h} is simplified to
10191debfc3dSmrg@file{/usr/include/foo.h} before we enter it in the splay tree and try
10201debfc3dSmrgto @code{open ()} the file.  CPP will then find subsequent uses of
10211debfc3dSmrg@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
10221debfc3dSmrgsave system calls.
10231debfc3dSmrg
10241debfc3dSmrgFurther, it is likely the file contents have also been cached, saving a
10251debfc3dSmrg@code{read ()} system call.  We don't bother caching the contents of
10261debfc3dSmrgheader files that are re-inclusion protected, and whose re-inclusion
10271debfc3dSmrgmacro is defined when we leave the header file for the first time.  If
10281debfc3dSmrgthe host supports it, we try to map suitably large files into memory,
10291debfc3dSmrgrather than reading them in directly.
10301debfc3dSmrg
10311debfc3dSmrgThe include paths are internally stored on a null-terminated
10321debfc3dSmrgsingly-linked list, starting with the @code{"header.h"} directory search
10331debfc3dSmrgchain, which then links into the @code{<header.h>} directory chain.
10341debfc3dSmrg
10351debfc3dSmrgFiles included with the @code{<foo.h>} syntax start the lookup directly
10361debfc3dSmrgin the second half of this chain.  However, files included with the
10371debfc3dSmrg@code{"foo.h"} syntax start at the beginning of the chain, but with one
10381debfc3dSmrgextra directory prepended.  This is the directory of the current file;
10391debfc3dSmrgthe one containing the @code{#include} directive.  Prepending this
10401debfc3dSmrgdirectory on a per-file basis is handled by the function
10411debfc3dSmrg@code{search_from}.
10421debfc3dSmrg
10431debfc3dSmrgNote that a header included with a directory component, such as
10441debfc3dSmrg@code{#include "mydir/foo.h"} and opened as
10451debfc3dSmrg@file{/usr/local/include/mydir/foo.h}, will have the complete path minus
10461debfc3dSmrgthe basename @samp{foo.h} as the current directory.
10471debfc3dSmrg
10481debfc3dSmrgEnough information is stored in the splay tree that CPP can immediately
10491debfc3dSmrgtell whether it can skip the header file because of the multiple include
10501debfc3dSmrgoptimization, whether the file didn't exist or couldn't be opened for
10511debfc3dSmrgsome reason, or whether the header was flagged not to be re-used, as it
10521debfc3dSmrgis with the obsolete @code{#import} directive.
10531debfc3dSmrg
10541debfc3dSmrgFor the benefit of MS-DOS filesystems with an 8.3 filename limitation,
10551debfc3dSmrgCPP offers the ability to treat various include file names as aliases
10561debfc3dSmrgfor the real header files with shorter names.  The map from one to the
10571debfc3dSmrgother is found in a special file called @samp{header.gcc}, stored in the
10581debfc3dSmrgcommand line (or system) include directories to which the mapping
10591debfc3dSmrgapplies.  This may be higher up the directory tree than the full path to
10601debfc3dSmrgthe file minus the base name.
10611debfc3dSmrg
10621debfc3dSmrg@node Concept Index
10631debfc3dSmrg@unnumbered Concept Index
10641debfc3dSmrg@printindex cp
10651debfc3dSmrg
10661debfc3dSmrg@bye
1067