xref: /netbsd-src/external/gpl3/gcc.old/dist/gcc/doc/cppinternals.info (revision 23f5f46327e37e7811da3520f4bb933f9489322f)
1*23f5f463SmrgThis is cppinternals.info, produced by makeinfo version 6.5 from
2a2dc1f3fSmrgcppinternals.texi.
31debfc3dSmrg
41debfc3dSmrgINFO-DIR-SECTION Software development
51debfc3dSmrgSTART-INFO-DIR-ENTRY
61debfc3dSmrg* Cpplib: (cppinternals).      Cpplib internals.
71debfc3dSmrgEND-INFO-DIR-ENTRY
81debfc3dSmrg
91debfc3dSmrgThis file documents the internals of the GNU C Preprocessor.
101debfc3dSmrg
118feb0f0bSmrg   Copyright (C) 2000-2020 Free Software Foundation, Inc.
121debfc3dSmrg
131debfc3dSmrg   Permission is granted to make and distribute verbatim copies of this
141debfc3dSmrgmanual provided the copyright notice and this permission notice are
151debfc3dSmrgpreserved on all copies.
161debfc3dSmrg
171debfc3dSmrg   Permission is granted to copy and distribute modified versions of
181debfc3dSmrgthis manual under the conditions for verbatim copying, provided also
19a2dc1f3fSmrgthat the entire resulting derived work is distributed under the terms of
20a2dc1f3fSmrga permission notice identical to this one.
211debfc3dSmrg
221debfc3dSmrg   Permission is granted to copy and distribute translations of this
231debfc3dSmrgmanual into another language, under the above conditions for modified
241debfc3dSmrgversions.
251debfc3dSmrg
261debfc3dSmrg
271debfc3dSmrgFile: cppinternals.info,  Node: Top,  Next: Conventions,  Up: (dir)
281debfc3dSmrg
291debfc3dSmrgThe GNU C Preprocessor Internals
301debfc3dSmrg********************************
311debfc3dSmrg
32a2dc1f3fSmrg* Menu:
33a2dc1f3fSmrg
34a2dc1f3fSmrg* Conventions::
35a2dc1f3fSmrg* Lexer::
36a2dc1f3fSmrg* Hash Nodes::
37a2dc1f3fSmrg* Macro Expansion::
38a2dc1f3fSmrg* Token Spacing::
39a2dc1f3fSmrg* Line Numbering::
40a2dc1f3fSmrg* Guard Macros::
41a2dc1f3fSmrg* Files::
42a2dc1f3fSmrg* Concept Index::
43a2dc1f3fSmrg
441debfc3dSmrg1 Cpplib--the GNU C Preprocessor
451debfc3dSmrg********************************
461debfc3dSmrg
471debfc3dSmrgThe GNU C preprocessor is implemented as a library, "cpplib", so it can
481debfc3dSmrgbe easily shared between a stand-alone preprocessor, and a preprocessor
491debfc3dSmrgintegrated with the C, C++ and Objective-C front ends.  It is also
501debfc3dSmrgavailable for use by other programs, though this is not recommended as
511debfc3dSmrgits exposed interface has not yet reached a point of reasonable
521debfc3dSmrgstability.
531debfc3dSmrg
541debfc3dSmrg   The library has been written to be re-entrant, so that it can be used
551debfc3dSmrgto preprocess many files simultaneously if necessary.  It has also been
561debfc3dSmrgwritten with the preprocessing token as the fundamental unit; the
571debfc3dSmrgpreprocessor in previous versions of GCC would operate on text strings
581debfc3dSmrgas the fundamental unit.
591debfc3dSmrg
601debfc3dSmrg   This brief manual documents the internals of cpplib, and explains
61a2dc1f3fSmrgsome of the tricky issues.  It is intended that, along with the comments
62a2dc1f3fSmrgin the source code, a reasonably competent C programmer should be able
63a2dc1f3fSmrgto figure out what the code is doing, and why things have been
641debfc3dSmrgimplemented the way they have.
651debfc3dSmrg
661debfc3dSmrg* Menu:
671debfc3dSmrg
681debfc3dSmrg* Conventions::         Conventions used in the code.
691debfc3dSmrg* Lexer::               The combined C, C++ and Objective-C Lexer.
701debfc3dSmrg* Hash Nodes::          All identifiers are entered into a hash table.
711debfc3dSmrg* Macro Expansion::     Macro expansion algorithm.
721debfc3dSmrg* Token Spacing::       Spacing and paste avoidance issues.
731debfc3dSmrg* Line Numbering::      Tracking location within files.
741debfc3dSmrg* Guard Macros::        Optimizing header files with guard macros.
751debfc3dSmrg* Files::               File handling.
761debfc3dSmrg* Concept Index::       Index.
771debfc3dSmrg
781debfc3dSmrg
791debfc3dSmrgFile: cppinternals.info,  Node: Conventions,  Next: Lexer,  Prev: Top,  Up: Top
801debfc3dSmrg
811debfc3dSmrgConventions
821debfc3dSmrg***********
831debfc3dSmrg
84a2dc1f3fSmrgcpplib has two interfaces--one is exposed internally only, and the other
85a2dc1f3fSmrgis for both internal and external use.
861debfc3dSmrg
871debfc3dSmrg   The convention is that functions and types that are exposed to
88a2dc1f3fSmrgmultiple files internally are prefixed with '_cpp_', and are to be found
89a2dc1f3fSmrgin the file 'internal.h'.  Functions and types exposed to external
90a2dc1f3fSmrgclients are in 'cpplib.h', and prefixed with 'cpp_'.  For historical
911debfc3dSmrgreasons this is no longer quite true, but we should strive to stick to
921debfc3dSmrgit.
931debfc3dSmrg
94a2dc1f3fSmrg   We are striving to reduce the information exposed in 'cpplib.h' to
951debfc3dSmrgthe bare minimum necessary, and then to keep it there.  This makes clear
961debfc3dSmrgexactly what external clients are entitled to assume, and allows us to
971debfc3dSmrgchange internals in the future without worrying whether library clients
981debfc3dSmrgare perhaps relying on some kind of undocumented implementation-specific
991debfc3dSmrgbehavior.
1001debfc3dSmrg
1011debfc3dSmrg
1021debfc3dSmrgFile: cppinternals.info,  Node: Lexer,  Next: Hash Nodes,  Prev: Conventions,  Up: Top
1031debfc3dSmrg
1041debfc3dSmrgThe Lexer
1051debfc3dSmrg*********
1061debfc3dSmrg
1071debfc3dSmrgOverview
1081debfc3dSmrg========
1091debfc3dSmrg
110a2dc1f3fSmrgThe lexer is contained in the file 'lex.c'.  It is a hand-coded lexer,
1111debfc3dSmrgand not implemented as a state machine.  It can understand C, C++ and
1121debfc3dSmrgObjective-C source code, and has been extended to allow reasonably
1131debfc3dSmrgsuccessful preprocessing of assembly language.  The lexer does not make
1141debfc3dSmrgan initial pass to strip out trigraphs and escaped newlines, but handles
1151debfc3dSmrgthem as they are encountered in a single pass of the input file.  It
1161debfc3dSmrgreturns preprocessing tokens individually, not a line at a time.
1171debfc3dSmrg
1181debfc3dSmrg   It is mostly transparent to users of the library, since the library's
119a2dc1f3fSmrginterface for obtaining the next token, 'cpp_get_token', takes care of
1201debfc3dSmrglexing new tokens, handling directives, and expanding macros as
1211debfc3dSmrgnecessary.  However, the lexer does expose some functionality so that
1221debfc3dSmrgclients of the library can easily spell a given token, such as
123a2dc1f3fSmrg'cpp_spell_token' and 'cpp_token_len'.  These functions are useful when
1241debfc3dSmrggenerating diagnostics, and for emitting the preprocessed output.
1251debfc3dSmrg
1261debfc3dSmrgLexing a token
1271debfc3dSmrg==============
1281debfc3dSmrg
129a2dc1f3fSmrgLexing of an individual token is handled by '_cpp_lex_direct' and its
1301debfc3dSmrgsubroutines.  In its current form the code is quite complicated, with
1311debfc3dSmrgread ahead characters and such-like, since it strives to not step back
1321debfc3dSmrgin the character stream in preparation for handling non-ASCII file
1331debfc3dSmrgencodings.  The current plan is to convert any such files to UTF-8
1341debfc3dSmrgbefore processing them.  This complexity is therefore unnecessary and
1351debfc3dSmrgwill be removed, so I'll not discuss it further here.
1361debfc3dSmrg
137a2dc1f3fSmrg   The job of '_cpp_lex_direct' is simply to lex a token.  It is not
1381debfc3dSmrgresponsible for issues like directive handling, returning lookahead
1391debfc3dSmrgtokens directly, multiple-include optimization, or conditional block
140*23f5f463Smrgskipping.  It necessarily has a minor ro^le to play in memory management
141a2dc1f3fSmrgof lexed lines.  I discuss these issues in a separate section (*note
142a2dc1f3fSmrgLexing a line::).
1431debfc3dSmrg
1441debfc3dSmrg   The lexer places the token it lexes into storage pointed to by the
145a2dc1f3fSmrgvariable 'cur_token', and then increments it.  This variable is
1461debfc3dSmrgimportant for correct diagnostic positioning.  Unless a specific line
1471debfc3dSmrgand column are passed to the diagnostic routines, they will examine the
148a2dc1f3fSmrg'line' and 'col' values of the token just before the location that
149a2dc1f3fSmrg'cur_token' points to, and use that location to report the diagnostic.
1501debfc3dSmrg
1511debfc3dSmrg   The lexer does not consider whitespace to be a token in its own
1521debfc3dSmrgright.  If whitespace (other than a new line) precedes a token, it sets
153a2dc1f3fSmrgthe 'PREV_WHITE' bit in the token's flags.  Each token has its 'line'
154a2dc1f3fSmrgand 'col' variables set to the line and column of the first character of
155a2dc1f3fSmrgthe token.  This line number is the line number in the translation unit,
156a2dc1f3fSmrgand can be converted to a source (file, line) pair using the line map
157a2dc1f3fSmrgcode.
1581debfc3dSmrg
159a2dc1f3fSmrg   The first token on a logical, i.e. unescaped, line has the flag 'BOL'
160a2dc1f3fSmrgset for beginning-of-line.  This flag is intended for internal use, both
161a2dc1f3fSmrgto distinguish a '#' that begins a directive from one that doesn't, and
162a2dc1f3fSmrgto generate a call-back to clients that want to be notified about the
163a2dc1f3fSmrgstart of every non-directive line with tokens on it.  Clients cannot
164a2dc1f3fSmrgreliably determine this for themselves: the first token might be a
165a2dc1f3fSmrgmacro, and the tokens of a macro expansion do not have the 'BOL' flag
166a2dc1f3fSmrgset.  The macro expansion may even be empty, and the next token on the
167a2dc1f3fSmrgline certainly won't have the 'BOL' flag set.
1681debfc3dSmrg
1691debfc3dSmrg   New lines are treated specially; exactly how the lexer handles them
1701debfc3dSmrgis context-dependent.  The C standard mandates that directives are
1711debfc3dSmrgterminated by the first unescaped newline character, even if it appears
1721debfc3dSmrgin the middle of a macro expansion.  Therefore, if the state variable
173a2dc1f3fSmrg'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is
174a2dc1f3fSmrgnormally used to indicate end-of-file, to indicate end-of-directive.  In
175a2dc1f3fSmrga directive a 'CPP_EOF' token never means end-of-file.  Conveniently, if
176a2dc1f3fSmrgthe caller was 'collect_args', it already handles 'CPP_EOF' as if it
177a2dc1f3fSmrgwere end-of-file, and reports an error about an unterminated macro
178a2dc1f3fSmrgargument list.
1791debfc3dSmrg
1801debfc3dSmrg   The C standard also specifies that a new line in the middle of the
1811debfc3dSmrgarguments to a macro is treated as whitespace.  This white space is
1821debfc3dSmrgimportant in case the macro argument is stringized.  The state variable
183a2dc1f3fSmrg'parsing_args' is nonzero when the preprocessor is collecting the
1841debfc3dSmrgarguments to a macro call.  It is set to 1 when looking for the opening
1851debfc3dSmrgparenthesis to a function-like macro, and 2 when collecting the actual
1861debfc3dSmrgarguments up to the closing parenthesis, since these two cases need to
1871debfc3dSmrgbe distinguished sometimes.  One such time is here: the lexer sets the
188a2dc1f3fSmrg'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args'
1891debfc3dSmrgis set to 2.  It doesn't set it if it meets a new line when
190a2dc1f3fSmrg'parsing_args' is 1, since then code like
1911debfc3dSmrg
1921debfc3dSmrg     #define foo() bar
1931debfc3dSmrg     foo
1941debfc3dSmrg     baz
1951debfc3dSmrg
196a2dc1f3fSmrgwould be output with an erroneous space before 'baz':
1971debfc3dSmrg
1981debfc3dSmrg     foo
1991debfc3dSmrg      baz
2001debfc3dSmrg
2011debfc3dSmrg   This is a good example of the subtlety of getting token spacing
2021debfc3dSmrgcorrect in the preprocessor; there are plenty of tests in the testsuite
2031debfc3dSmrgfor corner cases like this.
2041debfc3dSmrg
205a2dc1f3fSmrg   The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r'
2061debfc3dSmrgas a single new line indicator.  This allows it to transparently
2071debfc3dSmrgpreprocess MS-DOS, Macintosh and Unix files without their needing to
2081debfc3dSmrgpass through a special filter beforehand.
2091debfc3dSmrg
210a2dc1f3fSmrg   We also decided to treat a backslash, either '\' or the trigraph
211a2dc1f3fSmrg'??/', separated from one of the above newline indicators by non-comment
212a2dc1f3fSmrgwhitespace only, as intending to escape the newline.  It tends to be a
213a2dc1f3fSmrgtyping mistake, and cannot reasonably be mistaken for anything else in
214a2dc1f3fSmrgany of the C-family grammars.  Since handling it this way is not
215a2dc1f3fSmrgstrictly conforming to the ISO standard, the library issues a warning
216a2dc1f3fSmrgwherever it encounters it.
2171debfc3dSmrg
2181debfc3dSmrg   Handling newlines like this is made simpler by doing it in one place
219a2dc1f3fSmrgonly.  The function 'handle_newline' takes care of all newline
220a2dc1f3fSmrgcharacters, and 'skip_escaped_newlines' takes care of arbitrarily long
221a2dc1f3fSmrgsequences of escaped newlines, deferring to 'handle_newline' to handle
2221debfc3dSmrgthe newlines themselves.
2231debfc3dSmrg
2241debfc3dSmrg   The most painful aspect of lexing ISO-standard C and C++ is handling
2251debfc3dSmrgtrigraphs and backlash-escaped newlines.  Trigraphs are processed before
2261debfc3dSmrgany interpretation of the meaning of a character is made, and
2271debfc3dSmrgunfortunately there is a trigraph representation for a backslash, so it
228a2dc1f3fSmrgis possible for the trigraph '??/' to introduce an escaped newline.
2291debfc3dSmrg
2301debfc3dSmrg   Escaped newlines are tedious because theoretically they can occur
231a2dc1f3fSmrganywhere--between the '+' and '=' of the '+=' token, within the
232a2dc1f3fSmrgcharacters of an identifier, and even between the '*' and '/' that
2331debfc3dSmrgterminates a comment.  Moreover, you cannot be sure there is just
2341debfc3dSmrgone--there might be an arbitrarily long sequence of them.
2351debfc3dSmrg
236a2dc1f3fSmrg   So, for example, the routine that lexes a number, 'parse_number',
2371debfc3dSmrgcannot assume that it can scan forwards until the first non-number
238a2dc1f3fSmrgcharacter and be done with it, because this could be the '\' introducing
239a2dc1f3fSmrgan escaped newline, or the '?' introducing the trigraph sequence that
240a2dc1f3fSmrgrepresents the '\' of an escaped newline.  If it encounters a '?' or
241a2dc1f3fSmrg'\', it calls 'skip_escaped_newlines' to skip over any potential escaped
242a2dc1f3fSmrgnewlines before checking whether the number has been finished.
2431debfc3dSmrg
244a2dc1f3fSmrg   Similarly code in the main body of '_cpp_lex_direct' cannot simply
245a2dc1f3fSmrgcheck for a '=' after a '+' character to determine whether it has a '+='
246a2dc1f3fSmrgtoken; it needs to be prepared for an escaped newline of some sort.
247a2dc1f3fSmrgSuch cases use the function 'get_effective_char', which returns the
248a2dc1f3fSmrgfirst character after any intervening escaped newlines.
2491debfc3dSmrg
2501debfc3dSmrg   The lexer needs to keep track of the correct column position,
251a2dc1f3fSmrgincluding counting tabs as specified by the '-ftabstop=' option.  This
2521debfc3dSmrgshould be done even within C-style comments; they can appear in the
2531debfc3dSmrgmiddle of a line, and we want to report diagnostics in the correct
2541debfc3dSmrgposition for text appearing after the end of the comment.
2551debfc3dSmrg
256a2dc1f3fSmrg   Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may
257a2dc1f3fSmrgbe invalid and require a diagnostic.  However, if they appear in a macro
258a2dc1f3fSmrgexpansion we don't want to complain with each use of the macro.  It is
259a2dc1f3fSmrgtherefore best to catch them during the lexing stage, in
260a2dc1f3fSmrg'parse_identifier'.  In both cases, whether a diagnostic is needed or
2611debfc3dSmrgnot is dependent upon the lexer's state.  For example, we don't want to
2621debfc3dSmrgissue a diagnostic for re-poisoning a poisoned identifier, or for using
263a2dc1f3fSmrg'__VA_ARGS__' in the expansion of a variable-argument macro.  Therefore
264a2dc1f3fSmrg'parse_identifier' makes use of state flags to determine whether a
2651debfc3dSmrgdiagnostic is appropriate.  Since we change state on a per-token basis,
2661debfc3dSmrgand don't lex whole lines at a time, this is not a problem.
2671debfc3dSmrg
2681debfc3dSmrg   Another place where state flags are used to change behavior is whilst
269a2dc1f3fSmrglexing header names.  Normally, a '<' would be lexed as a single token.
270a2dc1f3fSmrgAfter a '#include' directive, though, it should be lexed as a single
271a2dc1f3fSmrgtoken as far as the nearest '>' character.  Note that we don't allow the
272a2dc1f3fSmrgterminators of header names to be escaped; the first '"' or '>'
2731debfc3dSmrgterminates the header name.
2741debfc3dSmrg
2751debfc3dSmrg   Interpretation of some character sequences depends upon whether we
2761debfc3dSmrgare lexing C, C++ or Objective-C, and on the revision of the standard in
277a2dc1f3fSmrgforce.  For example, '::' is a single token in C++, but in C it is two
278a2dc1f3fSmrgseparate ':' tokens and almost certainly a syntax error.  Such cases are
279a2dc1f3fSmrghandled by '_cpp_lex_direct' based upon command-line flags stored in the
280a2dc1f3fSmrg'cpp_options' structure.
2811debfc3dSmrg
2821debfc3dSmrg   Once a token has been lexed, it leads an independent existence.  The
2831debfc3dSmrgspelling of numbers, identifiers and strings is copied to permanent
2841debfc3dSmrgstorage from the original input buffer, so a token remains valid and
285a2dc1f3fSmrgcorrect even if its source buffer is freed with '_cpp_pop_buffer'.  The
2861debfc3dSmrgstorage holding the spellings of such tokens remains until the client
2871debfc3dSmrgprogram calls cpp_destroy, probably at the end of the translation unit.
2881debfc3dSmrg
2891debfc3dSmrgLexing a line
2901debfc3dSmrg=============
2911debfc3dSmrg
2921debfc3dSmrgWhen the preprocessor was changed to return pointers to tokens, one
2931debfc3dSmrgfeature I wanted was some sort of guarantee regarding how long a
2941debfc3dSmrgreturned pointer remains valid.  This is important to the stand-alone
2951debfc3dSmrgpreprocessor, the future direction of the C family front ends, and even
2961debfc3dSmrgto cpplib itself internally.
2971debfc3dSmrg
2981debfc3dSmrg   Occasionally the preprocessor wants to be able to peek ahead in the
2991debfc3dSmrgtoken stream.  For example, after the name of a function-like macro, it
3001debfc3dSmrgwants to check the next token to see if it is an opening parenthesis.
3011debfc3dSmrgAnother example is that, after reading the first few tokens of a
302a2dc1f3fSmrg'#pragma' directive and not recognizing it as a registered pragma, it
3031debfc3dSmrgwants to backtrack and allow the user-defined handler for unknown
304a2dc1f3fSmrgpragmas to access the full '#pragma' token stream.  The stand-alone
3051debfc3dSmrgpreprocessor wants to be able to test the current token with the
3061debfc3dSmrgprevious one to see if a space needs to be inserted to preserve their
3071debfc3dSmrgseparate tokenization upon re-lexing (paste avoidance), so it needs to
3081debfc3dSmrgbe sure the pointer to the previous token is still valid.  The
3091debfc3dSmrgrecursive-descent C++ parser wants to be able to perform tentative
3101debfc3dSmrgparsing arbitrarily far ahead in the token stream, and then to be able
3111debfc3dSmrgto jump back to a prior position in that stream if necessary.
3121debfc3dSmrg
3131debfc3dSmrg   The rule I chose, which is fairly natural, is to arrange that the
3141debfc3dSmrgpreprocessor lex all tokens on a line consecutively into a token buffer,
3151debfc3dSmrgwhich I call a "token run", and when meeting an unescaped new line
3161debfc3dSmrg(newlines within comments do not count either), to start lexing back at
317a2dc1f3fSmrgthe beginning of the run.  Note that we do _not_ lex a line of tokens at
318a2dc1f3fSmrgonce; if we did that 'parse_identifier' would not have state flags
3191debfc3dSmrgavailable to warn about invalid identifiers (*note Invalid
3201debfc3dSmrgidentifiers::).
3211debfc3dSmrg
3221debfc3dSmrg   In other words, accessing tokens that appeared earlier in the current
3231debfc3dSmrgline is valid, but since each logical line overwrites the tokens of the
3241debfc3dSmrgprevious line, tokens from prior lines are unavailable.  In particular,
3251debfc3dSmrgsince a directive only occupies a single logical line, this means that
326a2dc1f3fSmrgthe directive handlers like the '#pragma' handler can jump around in the
327a2dc1f3fSmrgdirective's tokens if necessary.
3281debfc3dSmrg
3291debfc3dSmrg   Two issues remain: what about tokens that arise from macro
330a2dc1f3fSmrgexpansions, and what happens when we have a long line that overflows the
331a2dc1f3fSmrgtoken run?
3321debfc3dSmrg
3331debfc3dSmrg   Since we promise clients that we preserve the validity of pointers
3341debfc3dSmrgthat we have already returned for tokens that appeared earlier in the
335a2dc1f3fSmrgline, we cannot reallocate the run.  Instead, on overflow it is expanded
336a2dc1f3fSmrgby chaining a new token run on to the end of the existing one.
3371debfc3dSmrg
3381debfc3dSmrg   The tokens forming a macro's replacement list are collected by the
339a2dc1f3fSmrg'#define' handler, and placed in storage that is only freed by
340a2dc1f3fSmrg'cpp_destroy'.  So if a macro is expanded in the line of tokens, the
3411debfc3dSmrgpointers to the tokens of its expansion that are returned will always
3421debfc3dSmrgremain valid.  However, macros are a little trickier than that, since
3431debfc3dSmrgthey give rise to three sources of fresh tokens.  They are the built-in
344a2dc1f3fSmrgmacros like '__LINE__', and the '#' and '##' operators for stringizing
3451debfc3dSmrgand token pasting.  I handled this by allocating space for these tokens
346a2dc1f3fSmrgfrom the lexer's token run chain.  This means they automatically receive
347a2dc1f3fSmrgthe same lifetime guarantees as lexed tokens, and we don't need to
348a2dc1f3fSmrgconcern ourselves with freeing them.
3491debfc3dSmrg
3501debfc3dSmrg   Lexing into a line of tokens solves some of the token memory
3511debfc3dSmrgmanagement issues, but not all.  The opening parenthesis after a
3521debfc3dSmrgfunction-like macro name might lie on a different line, and the front
3531debfc3dSmrgends definitely want the ability to look ahead past the end of the
3541debfc3dSmrgcurrent line.  So cpplib only moves back to the start of the token run
355a2dc1f3fSmrgat the end of a line if the variable 'keep_tokens' is zero.
3561debfc3dSmrgLine-buffering is quite natural for the preprocessor, and as a result
3571debfc3dSmrgthe only time cpplib needs to increment this variable is whilst looking
3581debfc3dSmrgfor the opening parenthesis to, and reading the arguments of, a
359a2dc1f3fSmrgfunction-like macro.  In the near future cpplib will export an interface
360a2dc1f3fSmrgto increment and decrement this variable, so that clients can share full
361a2dc1f3fSmrgcontrol over the lifetime of token pointers too.
3621debfc3dSmrg
363a2dc1f3fSmrg   The routine '_cpp_lex_token' handles moving to new token runs,
364a2dc1f3fSmrgcalling '_cpp_lex_direct' to lex new tokens, or returning
3651debfc3dSmrgpreviously-lexed tokens if we stepped back in the token stream.  It also
366a2dc1f3fSmrgchecks each token for the 'BOL' flag, which might indicate a directive
3671debfc3dSmrgthat needs to be handled, or require a start-of-line call-back to be
368a2dc1f3fSmrgmade.  '_cpp_lex_token' also handles skipping over tokens in failed
3691debfc3dSmrgconditional blocks, and invalidates the control macro of the
3701debfc3dSmrgmultiple-include optimization if a token was successfully lexed outside
3711debfc3dSmrga directive.  In other words, its callers do not need to concern
3721debfc3dSmrgthemselves with such issues.
3731debfc3dSmrg
3741debfc3dSmrg
3751debfc3dSmrgFile: cppinternals.info,  Node: Hash Nodes,  Next: Macro Expansion,  Prev: Lexer,  Up: Top
3761debfc3dSmrg
3771debfc3dSmrgHash Nodes
3781debfc3dSmrg**********
3791debfc3dSmrg
3801debfc3dSmrgWhen cpplib encounters an "identifier", it generates a hash code for it
3811debfc3dSmrgand stores it in the hash table.  By "identifier" we mean tokens with
382a2dc1f3fSmrgtype 'CPP_NAME'; this includes identifiers in the usual C sense, as well
383a2dc1f3fSmrgas keywords, directive names, macro names and so on.  For example, all
384a2dc1f3fSmrgof 'pragma', 'int', 'foo' and '__GNUC__' are identifiers and hashed when
385a2dc1f3fSmrglexed.
3861debfc3dSmrg
3871debfc3dSmrg   Each node in the hash table contain various information about the
3881debfc3dSmrgidentifier it represents.  For example, its length and type.  At any one
3891debfc3dSmrgtime, each identifier falls into exactly one of three categories:
3901debfc3dSmrg
3911debfc3dSmrg   * Macros
3921debfc3dSmrg
3931debfc3dSmrg     These have been declared to be macros, either on the command line
394a2dc1f3fSmrg     or with '#define'.  A few, such as '__TIME__' are built-ins entered
395a2dc1f3fSmrg     in the hash table during initialization.  The hash node for a
396a2dc1f3fSmrg     normal macro points to a structure with more information about the
397a2dc1f3fSmrg     macro, such as whether it is function-like, how many arguments it
398a2dc1f3fSmrg     takes, and its expansion.  Built-in macros are flagged as special,
399a2dc1f3fSmrg     and instead contain an enum indicating which of the various
400a2dc1f3fSmrg     built-in macros it is.
4011debfc3dSmrg
4021debfc3dSmrg   * Assertions
4031debfc3dSmrg
404a2dc1f3fSmrg     Assertions are in a separate namespace to macros.  To enforce this,
405a2dc1f3fSmrg     cpp actually prepends a '#' character before hashing and entering
406a2dc1f3fSmrg     it in the hash table.  An assertion's node points to a chain of
407a2dc1f3fSmrg     answers to that assertion.
4081debfc3dSmrg
4091debfc3dSmrg   * Void
4101debfc3dSmrg
4111debfc3dSmrg     Everything else falls into this category--an identifier that is not
4121debfc3dSmrg     currently a macro, or a macro that has since been undefined with
413a2dc1f3fSmrg     '#undef'.
4141debfc3dSmrg
4151debfc3dSmrg     When preprocessing C++, this category also includes the named
416a2dc1f3fSmrg     operators, such as 'xor'.  In expressions these behave like the
4171debfc3dSmrg     operators they represent, but in contexts where the spelling of a
4181debfc3dSmrg     token matters they are spelt differently.  This spelling
4191debfc3dSmrg     distinction is relevant when they are operands of the stringizing
420a2dc1f3fSmrg     and pasting macro operators '#' and '##'.  Named operator hash
4211debfc3dSmrg     nodes are flagged, both to catch the spelling distinction and to
4221debfc3dSmrg     prevent them from being defined as macros.
4231debfc3dSmrg
4241debfc3dSmrg   The same identifiers share the same hash node.  Since each identifier
4251debfc3dSmrgtoken, after lexing, contains a pointer to its hash node, this is used
4261debfc3dSmrgto provide rapid lookup of various information.  For example, when
427a2dc1f3fSmrgparsing a '#define' statement, CPP flags each argument's identifier hash
428a2dc1f3fSmrgnode with the index of that argument.  This makes duplicated argument
429a2dc1f3fSmrgchecking an O(1) operation for each argument.  Similarly, for each
430a2dc1f3fSmrgidentifier in the macro's expansion, lookup to see if it is an argument,
431a2dc1f3fSmrgand which argument it is, is also an O(1) operation.  Further, each
432a2dc1f3fSmrgdirective name, such as 'endif', has an associated directive enum stored
433a2dc1f3fSmrgin its hash node, so that directive lookup is also O(1).
4341debfc3dSmrg
4351debfc3dSmrg
4361debfc3dSmrgFile: cppinternals.info,  Node: Macro Expansion,  Next: Token Spacing,  Prev: Hash Nodes,  Up: Top
4371debfc3dSmrg
4381debfc3dSmrgMacro Expansion Algorithm
4391debfc3dSmrg*************************
4401debfc3dSmrg
4411debfc3dSmrgMacro expansion is a tricky operation, fraught with nasty corner cases
4421debfc3dSmrgand situations that render what you thought was a nifty way to optimize
4431debfc3dSmrgthe preprocessor's expansion algorithm wrong in quite subtle ways.
4441debfc3dSmrg
4451debfc3dSmrg   I strongly recommend you have a good grasp of how the C and C++
446a2dc1f3fSmrgstandards require macros to be expanded before diving into this section,
447a2dc1f3fSmrglet alone the code!.  If you don't have a clear mental picture of how
448a2dc1f3fSmrgthings like nested macro expansion, stringizing and token pasting are
449a2dc1f3fSmrgsupposed to work, damage to your sanity can quickly result.
4501debfc3dSmrg
4511debfc3dSmrgInternal representation of macros
4521debfc3dSmrg=================================
4531debfc3dSmrg
4541debfc3dSmrgThe preprocessor stores macro expansions in tokenized form.  This saves
455a2dc1f3fSmrgrepeated lexing passes during expansion, at the cost of a small increase
456a2dc1f3fSmrgin memory consumption on average.  The tokens are stored contiguously in
457a2dc1f3fSmrgmemory, so a pointer to the first one and a token count is all you need
458a2dc1f3fSmrgto get the replacement list of a macro.
4591debfc3dSmrg
4601debfc3dSmrg   If the macro is a function-like macro the preprocessor also stores
4611debfc3dSmrgits parameters, in the form of an ordered list of pointers to the hash
4621debfc3dSmrgtable entry of each parameter's identifier.  Further, in the macro's
4631debfc3dSmrgstored expansion each occurrence of a parameter is replaced with a
464a2dc1f3fSmrgspecial token of type 'CPP_MACRO_ARG'.  Each such token holds the index
465a2dc1f3fSmrgof the parameter it represents in the parameter list, which allows rapid
466a2dc1f3fSmrgreplacement of parameters with their arguments during expansion.
4671debfc3dSmrgDespite this optimization it is still necessary to store the original
468a2dc1f3fSmrgparameters to the macro, both for dumping with e.g., '-dD', and to warn
4691debfc3dSmrgabout non-trivial macro redefinitions when the parameter names have
4701debfc3dSmrgchanged.
4711debfc3dSmrg
4721debfc3dSmrgMacro expansion overview
4731debfc3dSmrg========================
4741debfc3dSmrg
4751debfc3dSmrgThe preprocessor maintains a "context stack", implemented as a linked
476a2dc1f3fSmrglist of 'cpp_context' structures, which together represent the macro
477a2dc1f3fSmrgexpansion state at any one time.  The 'struct cpp_reader' member
478a2dc1f3fSmrgvariable 'context' points to the current top of this stack.  The top
4791debfc3dSmrgnormally holds the unexpanded replacement list of the innermost macro
4801debfc3dSmrgunder expansion, except when cpplib is about to pre-expand an argument,
4811debfc3dSmrgin which case it holds that argument's unexpanded tokens.
4821debfc3dSmrg
4831debfc3dSmrg   When there are no macros under expansion, cpplib is in "base
484a2dc1f3fSmrgcontext".  All contexts other than the base context contain a contiguous
485a2dc1f3fSmrglist of tokens delimited by a starting and ending token.  When not in
486a2dc1f3fSmrgbase context, cpplib obtains the next token from the list of the top
487a2dc1f3fSmrgcontext.  If there are no tokens left in the list, it pops that context
488a2dc1f3fSmrgoff the stack, and subsequent ones if necessary, until an unexhausted
489a2dc1f3fSmrgcontext is found or it returns to base context.  In base context, cpplib
490a2dc1f3fSmrgreads tokens directly from the lexer.
4911debfc3dSmrg
4921debfc3dSmrg   If it encounters an identifier that is both a macro and enabled for
4931debfc3dSmrgexpansion, cpplib prepares to push a new context for that macro on the
494a2dc1f3fSmrgstack by calling the routine 'enter_macro_context'.  When this routine
4951debfc3dSmrgreturns, the new context will contain the unexpanded tokens of the
4961debfc3dSmrgreplacement list of that macro.  In the case of function-like macros,
497a2dc1f3fSmrg'enter_macro_context' also replaces any parameters in the replacement
498a2dc1f3fSmrglist, stored as 'CPP_MACRO_ARG' tokens, with the appropriate macro
4991debfc3dSmrgargument.  If the standard requires that the parameter be replaced with
5001debfc3dSmrgits expanded argument, the argument will have been fully macro expanded
5011debfc3dSmrgfirst.
5021debfc3dSmrg
503a2dc1f3fSmrg   'enter_macro_context' also handles special macros like '__LINE__'.
5041debfc3dSmrgAlthough these macros expand to a single token which cannot contain any
505a2dc1f3fSmrgfurther macros, for reasons of token spacing (*note Token Spacing::) and
506a2dc1f3fSmrgsimplicity of implementation, cpplib handles these special macros by
507a2dc1f3fSmrgpushing a context containing just that one token.
5081debfc3dSmrg
509a2dc1f3fSmrg   The final thing that 'enter_macro_context' does before returning is
510a2dc1f3fSmrgto mark the macro disabled for expansion (except for special macros like
511a2dc1f3fSmrg'__TIME__').  The macro is re-enabled when its context is later popped
512a2dc1f3fSmrgfrom the context stack, as described above.  This strict ordering
513a2dc1f3fSmrgensures that a macro is disabled whilst its expansion is being scanned,
514a2dc1f3fSmrgbut that it is _not_ disabled whilst any arguments to it are being
515a2dc1f3fSmrgexpanded.
5161debfc3dSmrg
5171debfc3dSmrgScanning the replacement list for macros to expand
5181debfc3dSmrg==================================================
5191debfc3dSmrg
520a2dc1f3fSmrgThe C standard states that, after any parameters have been replaced with
521a2dc1f3fSmrgtheir possibly-expanded arguments, the replacement list is scanned for
522a2dc1f3fSmrgnested macros.  Further, any identifiers in the replacement list that
523a2dc1f3fSmrgare not expanded during this scan are never again eligible for expansion
524a2dc1f3fSmrgin the future, if the reason they were not expanded is that the macro in
525a2dc1f3fSmrgquestion was disabled.
5261debfc3dSmrg
5271debfc3dSmrg   Clearly this latter condition can only apply to tokens resulting from
5281debfc3dSmrgargument pre-expansion.  Other tokens never have an opportunity to be
5291debfc3dSmrgre-tested for expansion.  It is possible for identifiers that are
5301debfc3dSmrgfunction-like macros to not expand initially but to expand during a
5311debfc3dSmrglater scan.  This occurs when the identifier is the last token of an
5321debfc3dSmrgargument (and therefore originally followed by a comma or a closing
5331debfc3dSmrgparenthesis in its macro's argument list), and when it replaces its
5341debfc3dSmrgparameter in the macro's replacement list, the subsequent token happens
5351debfc3dSmrgto be an opening parenthesis (itself possibly the first token of an
5361debfc3dSmrgargument).
5371debfc3dSmrg
5381debfc3dSmrg   It is important to note that when cpplib reads the last token of a
5391debfc3dSmrggiven context, that context still remains on the stack.  Only when
5401debfc3dSmrglooking for the _next_ token do we pop it off the stack and drop to a
5411debfc3dSmrglower context.  This makes backing up by one token easy, but more
5421debfc3dSmrgimportantly ensures that the macro corresponding to the current context
5431debfc3dSmrgis still disabled when we are considering the last token of its
544a2dc1f3fSmrgreplacement list for expansion (or indeed expanding it).  As an example,
545a2dc1f3fSmrgwhich illustrates many of the points above, consider
5461debfc3dSmrg
5471debfc3dSmrg     #define foo(x) bar x
5481debfc3dSmrg     foo(foo) (2)
5491debfc3dSmrg
550a2dc1f3fSmrgwhich fully expands to 'bar foo (2)'.  During pre-expansion of the
551a2dc1f3fSmrgargument, 'foo' does not expand even though the macro is enabled, since
5521debfc3dSmrgit has no following parenthesis [pre-expansion of an argument only uses
5531debfc3dSmrgtokens from that argument; it cannot take tokens from whatever follows
554a2dc1f3fSmrgthe macro invocation].  This still leaves the argument token 'foo'
5551debfc3dSmrgeligible for future expansion.  Then, when re-scanning after argument
556a2dc1f3fSmrgreplacement, the token 'foo' is rejected for expansion, and marked
557a2dc1f3fSmrgineligible for future expansion, since the macro is now disabled.  It is
558a2dc1f3fSmrgdisabled because the replacement list 'bar foo' of the macro is still on
559a2dc1f3fSmrgthe context stack.
5601debfc3dSmrg
5611debfc3dSmrg   If instead the algorithm looked for an opening parenthesis first and
5621debfc3dSmrgthen tested whether the macro were disabled it would be subtly wrong.
563a2dc1f3fSmrgIn the example above, the replacement list of 'foo' would be popped in
564a2dc1f3fSmrgthe process of finding the parenthesis, re-enabling 'foo' and expanding
5651debfc3dSmrgit a second time.
5661debfc3dSmrg
5671debfc3dSmrgLooking for a function-like macro's opening parenthesis
5681debfc3dSmrg=======================================================
5691debfc3dSmrg
5701debfc3dSmrgFunction-like macros only expand when immediately followed by a
5711debfc3dSmrgparenthesis.  To do this cpplib needs to temporarily disable macros and
5721debfc3dSmrgread the next token.  Unfortunately, because of spacing issues (*note
5731debfc3dSmrgToken Spacing::), there can be fake padding tokens in-between, and if
574a2dc1f3fSmrgthe next real token is not a parenthesis cpplib needs to be able to back
575a2dc1f3fSmrgup that one token as well as retain the information in any intervening
576a2dc1f3fSmrgpadding tokens.
5771debfc3dSmrg
5781debfc3dSmrg   Backing up more than one token when macros are involved is not
5791debfc3dSmrgpermitted by cpplib, because in general it might involve issues like
5801debfc3dSmrgrestoring popped contexts onto the context stack, which are too hard.
581a2dc1f3fSmrgInstead, searching for the parenthesis is handled by a special function,
582a2dc1f3fSmrg'funlike_invocation_p', which remembers padding information as it reads
583a2dc1f3fSmrgtokens.  If the next real token is not an opening parenthesis, it backs
584a2dc1f3fSmrgup that one token, and then pushes an extra context just containing the
585a2dc1f3fSmrgpadding information if necessary.
5861debfc3dSmrg
5871debfc3dSmrgMarking tokens ineligible for future expansion
5881debfc3dSmrg==============================================
5891debfc3dSmrg
5901debfc3dSmrgAs discussed above, cpplib needs a way of marking tokens as
5911debfc3dSmrgunexpandable.  Since the tokens cpplib handles are read-only once they
5921debfc3dSmrghave been lexed, it instead makes a copy of the token and adds the flag
593a2dc1f3fSmrg'NO_EXPAND' to the copy.
5941debfc3dSmrg
5951debfc3dSmrg   For efficiency and to simplify memory management by avoiding having
5961debfc3dSmrgto remember to free these tokens, they are allocated as temporary tokens
5971debfc3dSmrgfrom the lexer's current token run (*note Lexing a line::) using the
598a2dc1f3fSmrgfunction '_cpp_temp_token'.  The tokens are then re-used once the
5991debfc3dSmrgcurrent line of tokens has been read in.
6001debfc3dSmrg
6011debfc3dSmrg   This might sound unsafe.  However, tokens runs are not re-used at the
6021debfc3dSmrgend of a line if it happens to be in the middle of a macro argument
6031debfc3dSmrglist, and cpplib only wants to back-up more than one lexer token in
6041debfc3dSmrgsituations where no macro expansion is involved, so the optimization is
6051debfc3dSmrgsafe.
6061debfc3dSmrg
6071debfc3dSmrg
6081debfc3dSmrgFile: cppinternals.info,  Node: Token Spacing,  Next: Line Numbering,  Prev: Macro Expansion,  Up: Top
6091debfc3dSmrg
6101debfc3dSmrgToken Spacing
6111debfc3dSmrg*************
6121debfc3dSmrg
6131debfc3dSmrgFirst, consider an issue that only concerns the stand-alone
6141debfc3dSmrgpreprocessor: there needs to be a guarantee that re-reading its
6151debfc3dSmrgpreprocessed output results in an identical token stream.  Without
6161debfc3dSmrgtaking special measures, this might not be the case because of macro
6171debfc3dSmrgsubstitution.  For example:
6181debfc3dSmrg
6191debfc3dSmrg     #define PLUS +
6201debfc3dSmrg     #define EMPTY
6211debfc3dSmrg     #define f(x) =x=
6221debfc3dSmrg     +PLUS -EMPTY- PLUS+ f(=)
6231debfc3dSmrg             ==> + + - - + + = = =
6241debfc3dSmrg     _not_
6251debfc3dSmrg             ==> ++ -- ++ ===
6261debfc3dSmrg
6271debfc3dSmrg   One solution would be to simply insert a space between all adjacent
6281debfc3dSmrgtokens.  However, we would like to keep space insertion to a minimum,
6291debfc3dSmrgboth for aesthetic reasons and because it causes problems for people who
6301debfc3dSmrgstill try to abuse the preprocessor for things like Fortran source and
6311debfc3dSmrgMakefiles.
6321debfc3dSmrg
633a2dc1f3fSmrg   For now, just notice that when tokens are added (or removed, as shown
634a2dc1f3fSmrgby the 'EMPTY' example) from the original lexed token stream, we need to
635a2dc1f3fSmrgcheck for accidental token pasting.  We call this "paste avoidance".
636a2dc1f3fSmrgToken addition and removal can only occur because of macro expansion,
637a2dc1f3fSmrgbut accidental pasting can occur in many places: both before and after
638a2dc1f3fSmrgeach macro replacement, each argument replacement, and additionally each
639a2dc1f3fSmrgtoken created by the '#' and '##' operators.
6401debfc3dSmrg
641a2dc1f3fSmrg   Look at how the preprocessor gets whitespace output correct normally.
642a2dc1f3fSmrgThe 'cpp_token' structure contains a flags byte, and one of those flags
643a2dc1f3fSmrgis 'PREV_WHITE'.  This is flagged by the lexer, and indicates that the
644a2dc1f3fSmrgtoken was preceded by whitespace of some form other than a new line.
645a2dc1f3fSmrgThe stand-alone preprocessor can use this flag to decide whether to
646a2dc1f3fSmrginsert a space between tokens in the output.
6471debfc3dSmrg
6481debfc3dSmrg   Now consider the result of the following macro expansion:
6491debfc3dSmrg
6501debfc3dSmrg     #define add(x, y, z) x + y +z;
6511debfc3dSmrg     sum = add (1,2, 3);
6521debfc3dSmrg             ==> sum = 1 + 2 +3;
6531debfc3dSmrg
654a2dc1f3fSmrg   The interesting thing here is that the tokens '1' and '2' are output
655a2dc1f3fSmrgwith a preceding space, and '3' is output without a preceding space, but
656a2dc1f3fSmrgwhen lexed none of these tokens had that property.  Careful
657a2dc1f3fSmrgconsideration reveals that '1' gets its preceding whitespace from the
658a2dc1f3fSmrgspace preceding 'add' in the macro invocation, _not_ replacement list.
659a2dc1f3fSmrg'2' gets its whitespace from the space preceding the parameter 'y' in
660a2dc1f3fSmrgthe macro replacement list, and '3' has no preceding space because
661a2dc1f3fSmrgparameter 'z' has none in the replacement list.
6621debfc3dSmrg
6631debfc3dSmrg   Once lexed, tokens are effectively fixed and cannot be altered, since
6641debfc3dSmrgpointers to them might be held in many places, in particular by
6651debfc3dSmrgin-progress macro expansions.  So instead of modifying the two tokens
666a2dc1f3fSmrgabove, the preprocessor inserts a special token, which I call a "padding
667a2dc1f3fSmrgtoken", into the token stream to indicate that spacing of the subsequent
668a2dc1f3fSmrgtoken is special.  The preprocessor inserts padding tokens in front of
669a2dc1f3fSmrgevery macro expansion and expanded macro argument.  These point to a
670a2dc1f3fSmrg"source token" from which the subsequent real token should inherit its
671a2dc1f3fSmrgspacing.  In the above example, the source tokens are 'add' in the macro
672a2dc1f3fSmrginvocation, and 'y' and 'z' in the macro replacement list, respectively.
6731debfc3dSmrg
674a2dc1f3fSmrg   It is quite easy to get multiple padding tokens in a row, for example
675a2dc1f3fSmrgif a macro's first replacement token expands straight into another
676a2dc1f3fSmrgmacro.
6771debfc3dSmrg
6781debfc3dSmrg     #define foo bar
6791debfc3dSmrg     #define bar baz
6801debfc3dSmrg     [foo]
6811debfc3dSmrg             ==> [baz]
6821debfc3dSmrg
683a2dc1f3fSmrg   Here, two padding tokens are generated with sources the 'foo' token
684a2dc1f3fSmrgbetween the brackets, and the 'bar' token from foo's replacement list,
685a2dc1f3fSmrgrespectively.  Clearly the first padding token is the one to use, so the
686a2dc1f3fSmrgoutput code should contain a rule that the first padding token in a
6871debfc3dSmrgsequence is the one that matters.
6881debfc3dSmrg
6891debfc3dSmrg   But what if a macro expansion is left?  Adjusting the above example
6901debfc3dSmrgslightly:
6911debfc3dSmrg
6921debfc3dSmrg     #define foo bar
6931debfc3dSmrg     #define bar EMPTY baz
6941debfc3dSmrg     #define EMPTY
6951debfc3dSmrg     [foo] EMPTY;
6961debfc3dSmrg             ==> [ baz] ;
6971debfc3dSmrg
698a2dc1f3fSmrg   As shown, now there should be a space before 'baz' and the semicolon
6991debfc3dSmrgin the output.
7001debfc3dSmrg
701a2dc1f3fSmrg   The rules we decided above fail for 'baz': we generate three padding
702a2dc1f3fSmrgtokens, one per macro invocation, before the token 'baz'.  We would then
703a2dc1f3fSmrghave it take its spacing from the first of these, which carries source
704a2dc1f3fSmrgtoken 'foo' with no leading space.
7051debfc3dSmrg
7061debfc3dSmrg   It is vital that cpplib get spacing correct in these examples since
7071debfc3dSmrgany of these macro expansions could be stringized, where spacing
7081debfc3dSmrgmatters.
7091debfc3dSmrg
7101debfc3dSmrg   So, this demonstrates that not just entering macro and argument
7111debfc3dSmrgexpansions, but leaving them requires special handling too.  I made
712a2dc1f3fSmrgcpplib insert a padding token with a 'NULL' source token when leaving
7131debfc3dSmrgmacro expansions, as well as after each replaced argument in a macro's
7141debfc3dSmrgreplacement list.  It also inserts appropriate padding tokens on either
715a2dc1f3fSmrgside of tokens created by the '#' and '##' operators.  I expanded the
716a2dc1f3fSmrgrule so that, if we see a padding token with a 'NULL' source token,
7171debfc3dSmrg_and_ that source token has no leading space, then we behave as if we
7181debfc3dSmrghave seen no padding tokens at all.  A quick check shows this rule will
7191debfc3dSmrgthen get the above example correct as well.
7201debfc3dSmrg
7211debfc3dSmrg   Now a relationship with paste avoidance is apparent: we have to be
7221debfc3dSmrgcareful about paste avoidance in exactly the same locations we have
7231debfc3dSmrgpadding tokens in order to get white space correct.  This makes
7241debfc3dSmrgimplementation of paste avoidance easy: wherever the stand-alone
7251debfc3dSmrgpreprocessor is fixing up spacing because of padding tokens, and it
7261debfc3dSmrgturns out that no space is needed, it has to take the extra step to
7271debfc3dSmrgcheck that a space is not needed after all to avoid an accidental paste.
728a2dc1f3fSmrgThe function 'cpp_avoid_paste' advises whether a space is required
7291debfc3dSmrgbetween two consecutive tokens.  To avoid excessive spacing, it tries
7301debfc3dSmrghard to only require a space if one is likely to be necessary, but for
7311debfc3dSmrgreasons of efficiency it is slightly conservative and might recommend a
7321debfc3dSmrgspace where one is not strictly needed.
7331debfc3dSmrg
7341debfc3dSmrg
7351debfc3dSmrgFile: cppinternals.info,  Node: Line Numbering,  Next: Guard Macros,  Prev: Token Spacing,  Up: Top
7361debfc3dSmrg
7371debfc3dSmrgLine numbering
7381debfc3dSmrg**************
7391debfc3dSmrg
7401debfc3dSmrgJust which line number anyway?
7411debfc3dSmrg==============================
7421debfc3dSmrg
7431debfc3dSmrgThere are three reasonable requirements a cpplib client might have for
7441debfc3dSmrgthe line number of a token passed to it:
7451debfc3dSmrg
7461debfc3dSmrg   * The source line it was lexed on.
7471debfc3dSmrg   * The line it is output on.  This can be different to the line it was
7481debfc3dSmrg     lexed on if, for example, there are intervening escaped newlines or
7491debfc3dSmrg     C-style comments.  For example:
7501debfc3dSmrg
7511debfc3dSmrg          foo /* A long
7521debfc3dSmrg          comment */ bar \
7531debfc3dSmrg          baz
7541debfc3dSmrg          =>
7551debfc3dSmrg          foo bar baz
7561debfc3dSmrg
7571debfc3dSmrg   * If the token results from a macro expansion, the line of the macro
7581debfc3dSmrg     name, or possibly the line of the closing parenthesis in the case
7591debfc3dSmrg     of function-like macro expansion.
7601debfc3dSmrg
761a2dc1f3fSmrg   The 'cpp_token' structure contains 'line' and 'col' members.  The
7621debfc3dSmrglexer fills these in with the line and column of the first character of
7631debfc3dSmrgthe token.  Consequently, but maybe unexpectedly, a token from the
7641debfc3dSmrgreplacement list of a macro expansion carries the location of the token
765a2dc1f3fSmrgwithin the '#define' directive, because cpplib expands a macro by
7661debfc3dSmrgreturning pointers to the tokens in its replacement list.  The current
767a2dc1f3fSmrgimplementation of cpplib assigns tokens created from built-in macros and
768a2dc1f3fSmrgthe '#' and '##' operators the location of the most recently lexed
7691debfc3dSmrgtoken.  This is a because they are allocated from the lexer's token
7701debfc3dSmrgruns, and because of the way the diagnostic routines infer the
7711debfc3dSmrgappropriate location to report.
7721debfc3dSmrg
7731debfc3dSmrg   The diagnostic routines in cpplib display the location of the most
7741debfc3dSmrgrecently _lexed_ token, unless they are passed a specific line and
7751debfc3dSmrgcolumn to report.  For diagnostics regarding tokens that arise from
7761debfc3dSmrgmacro expansions, it might also be helpful for the user to see the
7771debfc3dSmrgoriginal location in the macro definition that the token came from.
7781debfc3dSmrgSince that is exactly the information each token carries, such an
7791debfc3dSmrgenhancement could be made relatively easily in future.
7801debfc3dSmrg
7811debfc3dSmrg   The stand-alone preprocessor faces a similar problem when determining
7821debfc3dSmrgthe correct line to output the token on: the position attached to a
7831debfc3dSmrgtoken is fairly useless if the token came from a macro expansion.  All
7841debfc3dSmrgtokens on a logical line should be output on its first physical line, so
7851debfc3dSmrgthe token's reported location is also wrong if it is part of a physical
7861debfc3dSmrgline other than the first.
7871debfc3dSmrg
7881debfc3dSmrg   To solve these issues, cpplib provides a callback that is generated
7891debfc3dSmrgwhenever it lexes a preprocessing token that starts a new logical line
790a2dc1f3fSmrgother than a directive.  It passes this token (which may be a 'CPP_EOF'
7911debfc3dSmrgtoken indicating the end of the translation unit) to the callback
792a2dc1f3fSmrgroutine, which can then use the line and column of this token to produce
793a2dc1f3fSmrgcorrect output.
7941debfc3dSmrg
7951debfc3dSmrgRepresentation of line numbers
7961debfc3dSmrg==============================
7971debfc3dSmrg
7981debfc3dSmrgAs mentioned above, cpplib stores with each token the line number that
7991debfc3dSmrgit was lexed on.  In fact, this number is not the number of the line in
8001debfc3dSmrgthe source file, but instead bears more resemblance to the number of the
8011debfc3dSmrgline in the translation unit.
8021debfc3dSmrg
8031debfc3dSmrg   The preprocessor maintains a monotonic increasing line count, which
8041debfc3dSmrgis incremented at every new line character (and also at the end of any
8051debfc3dSmrgbuffer that does not end in a new line).  Since a line number of zero is
8061debfc3dSmrguseful to indicate certain special states and conditions, this variable
8071debfc3dSmrgstarts counting from one.
8081debfc3dSmrg
8091debfc3dSmrg   This variable therefore uniquely enumerates each line in the
8101debfc3dSmrgtranslation unit.  With some simple infrastructure, it is straight
8111debfc3dSmrgforward to map from this to the original source file and line number
8121debfc3dSmrgpair, saving space whenever line number information needs to be saved.
813a2dc1f3fSmrgThe code the implements this mapping lies in the files 'line-map.c' and
814a2dc1f3fSmrg'line-map.h'.
8151debfc3dSmrg
8161debfc3dSmrg   Command-line macros and assertions are implemented by pushing a
817a2dc1f3fSmrgbuffer containing the right hand side of an equivalent '#define' or
818a2dc1f3fSmrg'#assert' directive.  Some built-in macros are handled similarly.  Since
819a2dc1f3fSmrgthese are all processed before the first line of the main input file, it
820a2dc1f3fSmrgwill typically have an assigned line closer to twenty than to one.
8211debfc3dSmrg
8221debfc3dSmrg
8231debfc3dSmrgFile: cppinternals.info,  Node: Guard Macros,  Next: Files,  Prev: Line Numbering,  Up: Top
8241debfc3dSmrg
8251debfc3dSmrgThe Multiple-Include Optimization
8261debfc3dSmrg*********************************
8271debfc3dSmrg
8281debfc3dSmrgHeader files are often of the form
8291debfc3dSmrg
8301debfc3dSmrg     #ifndef FOO
8311debfc3dSmrg     #define FOO
8321debfc3dSmrg     ...
8331debfc3dSmrg     #endif
8341debfc3dSmrg
8351debfc3dSmrgto prevent the compiler from processing them more than once.  The
8361debfc3dSmrgpreprocessor notices such header files, so that if the header file
837a2dc1f3fSmrgappears in a subsequent '#include' directive and 'FOO' is defined, then
8381debfc3dSmrgit is ignored and it doesn't preprocess or even re-open the file a
8391debfc3dSmrgsecond time.  This is referred to as the "multiple include
8401debfc3dSmrgoptimization".
8411debfc3dSmrg
8421debfc3dSmrg   Under what circumstances is such an optimization valid?  If the file
8431debfc3dSmrgwere included a second time, it can only be optimized away if that
8441debfc3dSmrginclusion would result in no tokens to return, and no relevant
8451debfc3dSmrgdirectives to process.  Therefore the current implementation imposes
8461debfc3dSmrgrequirements and makes some allowances as follows:
8471debfc3dSmrg
848a2dc1f3fSmrg  1. There must be no tokens outside the controlling '#if'-'#endif'
8491debfc3dSmrg     pair, but whitespace and comments are permitted.
8501debfc3dSmrg
851a2dc1f3fSmrg  2. There must be no directives outside the controlling directive pair,
852a2dc1f3fSmrg     but the "null directive" (a line containing nothing other than a
853a2dc1f3fSmrg     single '#' and possibly whitespace) is permitted.
8541debfc3dSmrg
8551debfc3dSmrg  3. The opening directive must be of the form
8561debfc3dSmrg
8571debfc3dSmrg          #ifndef FOO
8581debfc3dSmrg
8591debfc3dSmrg     or
8601debfc3dSmrg
8611debfc3dSmrg          #if !defined FOO     [equivalently, #if !defined(FOO)]
8621debfc3dSmrg
863a2dc1f3fSmrg  4. In the second form above, the tokens forming the '#if' expression
8641debfc3dSmrg     must have come directly from the source file--no macro expansion
8651debfc3dSmrg     must have been involved.  This is because macro definitions can
866a2dc1f3fSmrg     change, and tracking whether or not a relevant change has been made
867a2dc1f3fSmrg     is not worth the implementation cost.
8681debfc3dSmrg
869a2dc1f3fSmrg  5. There can be no '#else' or '#elif' directives at the outer
8701debfc3dSmrg     conditional block level, because they would probably contain
8711debfc3dSmrg     something of interest to a subsequent pass.
8721debfc3dSmrg
8731debfc3dSmrg   First, when pushing a new file on the buffer stack,
874a2dc1f3fSmrg'_stack_include_file' sets the controlling macro 'mi_cmacro' to 'NULL',
875a2dc1f3fSmrgand sets 'mi_valid' to 'true'.  This indicates that the preprocessor has
876a2dc1f3fSmrgnot yet encountered anything that would invalidate the multiple-include
877a2dc1f3fSmrgoptimization.  As described in the next few paragraphs, these two
878a2dc1f3fSmrgvariables having these values effectively indicates top-of-file.
8791debfc3dSmrg
8801debfc3dSmrg   When about to return a token that is not part of a directive,
881a2dc1f3fSmrg'_cpp_lex_token' sets 'mi_valid' to 'false'.  This enforces the
8821debfc3dSmrgconstraint that tokens outside the controlling conditional block
8831debfc3dSmrginvalidate the optimization.
8841debfc3dSmrg
885a2dc1f3fSmrg   The 'do_if', when appropriate, and 'do_ifndef' directive handlers
886a2dc1f3fSmrgpass the controlling macro to the function 'push_conditional'.  cpplib
8871debfc3dSmrgmaintains a stack of nested conditional blocks, and after processing
888a2dc1f3fSmrgevery opening conditional this function pushes an 'if_stack' structure
8891debfc3dSmrgonto the stack.  In this structure it records the controlling macro for
8901debfc3dSmrgthe block, provided there is one and we're at top-of-file (as described
891a2dc1f3fSmrgabove).  If an '#elif' or '#else' directive is encountered, the
892a2dc1f3fSmrgcontrolling macro for that block is cleared to 'NULL'.  Otherwise, it
893a2dc1f3fSmrgsurvives until the '#endif' closing the block, upon which 'do_endif'
894a2dc1f3fSmrgsets 'mi_valid' to true and stores the controlling macro in 'mi_cmacro'.
8951debfc3dSmrg
896a2dc1f3fSmrg   '_cpp_handle_directive' clears 'mi_valid' when processing any
8971debfc3dSmrgdirective other than an opening conditional and the null directive.
8981debfc3dSmrgWith this, and requiring top-of-file to record a controlling macro, and
899a2dc1f3fSmrgno '#else' or '#elif' for it to survive and be copied to 'mi_cmacro' by
900a2dc1f3fSmrg'do_endif', we have enforced the absence of directives outside the main
9011debfc3dSmrgconditional block for the optimization to be on.
9021debfc3dSmrg
903a2dc1f3fSmrg   Note that whilst we are inside the conditional block, 'mi_valid' is
904a2dc1f3fSmrglikely to be reset to 'false', but this does not matter since the
905a2dc1f3fSmrgclosing '#endif' restores it to 'true' if appropriate.
9061debfc3dSmrg
907a2dc1f3fSmrg   Finally, since '_cpp_lex_direct' pops the file off the buffer stack
908a2dc1f3fSmrgat 'EOF' without returning a token, if the '#endif' directive was not
909a2dc1f3fSmrgfollowed by any tokens, 'mi_valid' is 'true' and '_cpp_pop_file_buffer'
9101debfc3dSmrgremembers the controlling macro associated with the file.  Subsequent
911a2dc1f3fSmrgcalls to 'stack_include_file' result in no buffer being pushed if the
9121debfc3dSmrgcontrolling macro is defined, effecting the optimization.
9131debfc3dSmrg
9141debfc3dSmrg   A quick word on how we handle the
9151debfc3dSmrg
9161debfc3dSmrg     #if !defined FOO
9171debfc3dSmrg
918a2dc1f3fSmrgcase.  '_cpp_parse_expr' and 'parse_defined' take steps to see whether
919a2dc1f3fSmrgthe three stages '!', 'defined-expression' and 'end-of-directive' occur
920a2dc1f3fSmrgin order in a '#if' expression.  If so, they return the guard macro to
921a2dc1f3fSmrg'do_if' in the variable 'mi_ind_cmacro', and otherwise set it to 'NULL'.
922a2dc1f3fSmrg'enter_macro_context' sets 'mi_valid' to false, so if a macro was
923a2dc1f3fSmrgexpanded whilst parsing any part of the expression, then the top-of-file
924a2dc1f3fSmrgtest in 'push_conditional' fails and the optimization is turned off.
9251debfc3dSmrg
9261debfc3dSmrg
9271debfc3dSmrgFile: cppinternals.info,  Node: Files,  Next: Concept Index,  Prev: Guard Macros,  Up: Top
9281debfc3dSmrg
9291debfc3dSmrgFile Handling
9301debfc3dSmrg*************
9311debfc3dSmrg
9321debfc3dSmrgFairly obviously, the file handling code of cpplib resides in the file
933a2dc1f3fSmrg'files.c'.  It takes care of the details of file searching, opening,
9341debfc3dSmrgreading and caching, for both the main source file and all the headers
9351debfc3dSmrgit recursively includes.
9361debfc3dSmrg
9371debfc3dSmrg   The basic strategy is to minimize the number of system calls.  On
938a2dc1f3fSmrgmany systems, the basic 'open ()' and 'fstat ()' system calls can be
939a2dc1f3fSmrgquite expensive.  For every '#include'-d file, we need to try all the
9401debfc3dSmrgdirectories in the search path until we find a match.  Some projects,
9411debfc3dSmrgsuch as glibc, pass twenty or thirty include paths on the command line,
9421debfc3dSmrgso this can rapidly become time consuming.
9431debfc3dSmrg
9441debfc3dSmrg   For a header file we have not encountered before we have little
9451debfc3dSmrgchoice but to do this.  However, it is often the case that the same
9461debfc3dSmrgheaders are repeatedly included, and in these cases we try to avoid
9471debfc3dSmrgrepeating the filesystem queries whilst searching for the correct file.
9481debfc3dSmrg
9491debfc3dSmrg   For each file we try to open, we store the constructed path in a
9501debfc3dSmrgsplay tree.  This path first undergoes simplification by the function
951a2dc1f3fSmrg'_cpp_simplify_pathname'.  For example, '/usr/include/bits/../foo.h' is
952a2dc1f3fSmrgsimplified to '/usr/include/foo.h' before we enter it in the splay tree
953a2dc1f3fSmrgand try to 'open ()' the file.  CPP will then find subsequent uses of
954a2dc1f3fSmrg'foo.h', even as '/usr/include/foo.h', in the splay tree and save system
955a2dc1f3fSmrgcalls.
9561debfc3dSmrg
957a2dc1f3fSmrg   Further, it is likely the file contents have also been cached, saving
958a2dc1f3fSmrga 'read ()' system call.  We don't bother caching the contents of header
959a2dc1f3fSmrgfiles that are re-inclusion protected, and whose re-inclusion macro is
960a2dc1f3fSmrgdefined when we leave the header file for the first time.  If the host
961a2dc1f3fSmrgsupports it, we try to map suitably large files into memory, rather than
962a2dc1f3fSmrgreading them in directly.
9631debfc3dSmrg
9641debfc3dSmrg   The include paths are internally stored on a null-terminated
965a2dc1f3fSmrgsingly-linked list, starting with the '"header.h"' directory search
966a2dc1f3fSmrgchain, which then links into the '<header.h>' directory chain.
9671debfc3dSmrg
968a2dc1f3fSmrg   Files included with the '<foo.h>' syntax start the lookup directly in
969a2dc1f3fSmrgthe second half of this chain.  However, files included with the
970a2dc1f3fSmrg'"foo.h"' syntax start at the beginning of the chain, but with one extra
971a2dc1f3fSmrgdirectory prepended.  This is the directory of the current file; the one
972a2dc1f3fSmrgcontaining the '#include' directive.  Prepending this directory on a
973a2dc1f3fSmrgper-file basis is handled by the function 'search_from'.
9741debfc3dSmrg
9751debfc3dSmrg   Note that a header included with a directory component, such as
976a2dc1f3fSmrg'#include "mydir/foo.h"' and opened as '/usr/local/include/mydir/foo.h',
977a2dc1f3fSmrgwill have the complete path minus the basename 'foo.h' as the current
978a2dc1f3fSmrgdirectory.
9791debfc3dSmrg
9801debfc3dSmrg   Enough information is stored in the splay tree that CPP can
9811debfc3dSmrgimmediately tell whether it can skip the header file because of the
982a2dc1f3fSmrgmultiple include optimization, whether the file didn't exist or couldn't
983a2dc1f3fSmrgbe opened for some reason, or whether the header was flagged not to be
984a2dc1f3fSmrgre-used, as it is with the obsolete '#import' directive.
9851debfc3dSmrg
9861debfc3dSmrg   For the benefit of MS-DOS filesystems with an 8.3 filename
9871debfc3dSmrglimitation, CPP offers the ability to treat various include file names
9881debfc3dSmrgas aliases for the real header files with shorter names.  The map from
989a2dc1f3fSmrgone to the other is found in a special file called 'header.gcc', stored
9901debfc3dSmrgin the command line (or system) include directories to which the mapping
9911debfc3dSmrgapplies.  This may be higher up the directory tree than the full path to
9921debfc3dSmrgthe file minus the base name.
9931debfc3dSmrg
9941debfc3dSmrg
9951debfc3dSmrgFile: cppinternals.info,  Node: Concept Index,  Prev: Files,  Up: Top
9961debfc3dSmrg
9971debfc3dSmrgConcept Index
9981debfc3dSmrg*************
9991debfc3dSmrg
10001debfc3dSmrg�[index�]
10011debfc3dSmrg* Menu:
10021debfc3dSmrg
10031debfc3dSmrg* assertions:                            Hash Nodes.          (line   6)
10041debfc3dSmrg* controlling macros:                    Guard Macros.        (line   6)
1005a2dc1f3fSmrg* escaped newlines:                      Lexer.               (line   5)
10061debfc3dSmrg* files:                                 Files.               (line   6)
10071debfc3dSmrg* guard macros:                          Guard Macros.        (line   6)
10081debfc3dSmrg* hash table:                            Hash Nodes.          (line   6)
10091debfc3dSmrg* header files:                          Conventions.         (line   6)
10101debfc3dSmrg* identifiers:                           Hash Nodes.          (line   6)
10111debfc3dSmrg* interface:                             Conventions.         (line   6)
10121debfc3dSmrg* lexer:                                 Lexer.               (line   6)
1013a2dc1f3fSmrg* line numbers:                          Line Numbering.      (line   5)
10141debfc3dSmrg* macro expansion:                       Macro Expansion.     (line   6)
10151debfc3dSmrg* macro representation (internal):       Macro Expansion.     (line  19)
10161debfc3dSmrg* macros:                                Hash Nodes.          (line   6)
10171debfc3dSmrg* multiple-include optimization:         Guard Macros.        (line   6)
10181debfc3dSmrg* named operators:                       Hash Nodes.          (line   6)
10191debfc3dSmrg* newlines:                              Lexer.               (line   6)
10201debfc3dSmrg* paste avoidance:                       Token Spacing.       (line   6)
10211debfc3dSmrg* spacing:                               Token Spacing.       (line   6)
1022a2dc1f3fSmrg* token run:                             Lexer.               (line 191)
10231debfc3dSmrg* token spacing:                         Token Spacing.       (line   6)
10241debfc3dSmrg
10251debfc3dSmrg
10261debfc3dSmrg
10271debfc3dSmrgTag Table:
1028a2dc1f3fSmrgNode: Top905
1029a2dc1f3fSmrgNode: Conventions2743
1030a2dc1f3fSmrgNode: Lexer3685
1031a2dc1f3fSmrgRef: Invalid identifiers11599
1032a2dc1f3fSmrgRef: Lexing a line13549
1033a2dc1f3fSmrgNode: Hash Nodes18318
1034a2dc1f3fSmrgNode: Macro Expansion21197
1035a2dc1f3fSmrgNode: Token Spacing30141
1036a2dc1f3fSmrgNode: Line Numbering35997
1037a2dc1f3fSmrgNode: Guard Macros40082
1038a2dc1f3fSmrgNode: Files44873
1039a2dc1f3fSmrgNode: Concept Index48339
10401debfc3dSmrg
10411debfc3dSmrgEnd Tag Table
1042