xref: /netbsd-src/external/gpl3/gcc/dist/gcc/doc/cppinternals.info (revision 4fe0f936ff464bca8e6277bde90f477ef5a4d004)
1*4fe0f936SmrgThis is cppinternals.info, produced by makeinfo version 6.8 from
296d60fd4Smrgcppinternals.texi.
36fac5056Sskrll
46fac5056SskrllINFO-DIR-SECTION Software development
56fac5056SskrllSTART-INFO-DIR-ENTRY
66fac5056Sskrll* Cpplib: (cppinternals).      Cpplib internals.
76fac5056SskrllEND-INFO-DIR-ENTRY
86fac5056Sskrll
96fac5056SskrllThis file documents the internals of the GNU C Preprocessor.
106fac5056Sskrll
11e9e6e0f6Smrg   Copyright (C) 2000-2022 Free Software Foundation, Inc.
126fac5056Sskrll
136fac5056Sskrll   Permission is granted to make and distribute verbatim copies of this
146fac5056Sskrllmanual provided the copyright notice and this permission notice are
156fac5056Sskrllpreserved on all copies.
166fac5056Sskrll
176fac5056Sskrll   Permission is granted to copy and distribute modified versions of
186fac5056Sskrllthis manual under the conditions for verbatim copying, provided also
1996d60fd4Smrgthat the entire resulting derived work is distributed under the terms of
2096d60fd4Smrga permission notice identical to this one.
216fac5056Sskrll
226fac5056Sskrll   Permission is granted to copy and distribute translations of this
236fac5056Sskrllmanual into another language, under the above conditions for modified
246fac5056Sskrllversions.
256fac5056Sskrll
266fac5056Sskrll
276fac5056SskrllFile: cppinternals.info,  Node: Top,  Next: Conventions,  Up: (dir)
286fac5056Sskrll
296fac5056SskrllThe GNU C Preprocessor Internals
306fac5056Sskrll********************************
316fac5056Sskrll
3296d60fd4Smrg* Menu:
3396d60fd4Smrg
3496d60fd4Smrg* Conventions::
3596d60fd4Smrg* Lexer::
3696d60fd4Smrg* Hash Nodes::
3796d60fd4Smrg* Macro Expansion::
3896d60fd4Smrg* Token Spacing::
3996d60fd4Smrg* Line Numbering::
4096d60fd4Smrg* Guard Macros::
4196d60fd4Smrg* Files::
4296d60fd4Smrg* Concept Index::
4396d60fd4Smrg
446fac5056Sskrll1 Cpplib--the GNU C Preprocessor
456fac5056Sskrll********************************
466fac5056Sskrll
476fac5056SskrllThe GNU C preprocessor is implemented as a library, "cpplib", so it can
486fac5056Sskrllbe easily shared between a stand-alone preprocessor, and a preprocessor
496fac5056Sskrllintegrated with the C, C++ and Objective-C front ends.  It is also
506fac5056Sskrllavailable for use by other programs, though this is not recommended as
516fac5056Sskrllits exposed interface has not yet reached a point of reasonable
526fac5056Sskrllstability.
536fac5056Sskrll
546fac5056Sskrll   The library has been written to be re-entrant, so that it can be used
556fac5056Sskrllto preprocess many files simultaneously if necessary.  It has also been
566fac5056Sskrllwritten with the preprocessing token as the fundamental unit; the
576fac5056Sskrllpreprocessor in previous versions of GCC would operate on text strings
586fac5056Sskrllas the fundamental unit.
596fac5056Sskrll
606fac5056Sskrll   This brief manual documents the internals of cpplib, and explains
6196d60fd4Smrgsome of the tricky issues.  It is intended that, along with the comments
6296d60fd4Smrgin the source code, a reasonably competent C programmer should be able
6396d60fd4Smrgto figure out what the code is doing, and why things have been
646fac5056Sskrllimplemented the way they have.
656fac5056Sskrll
666fac5056Sskrll* Menu:
676fac5056Sskrll
686fac5056Sskrll* Conventions::         Conventions used in the code.
696fac5056Sskrll* Lexer::               The combined C, C++ and Objective-C Lexer.
706fac5056Sskrll* Hash Nodes::          All identifiers are entered into a hash table.
716fac5056Sskrll* Macro Expansion::     Macro expansion algorithm.
726fac5056Sskrll* Token Spacing::       Spacing and paste avoidance issues.
736fac5056Sskrll* Line Numbering::      Tracking location within files.
746fac5056Sskrll* Guard Macros::        Optimizing header files with guard macros.
756fac5056Sskrll* Files::               File handling.
766fac5056Sskrll* Concept Index::       Index.
776fac5056Sskrll
786fac5056Sskrll
796fac5056SskrllFile: cppinternals.info,  Node: Conventions,  Next: Lexer,  Prev: Top,  Up: Top
806fac5056Sskrll
816fac5056SskrllConventions
826fac5056Sskrll***********
836fac5056Sskrll
8496d60fd4Smrgcpplib has two interfaces--one is exposed internally only, and the other
8596d60fd4Smrgis for both internal and external use.
866fac5056Sskrll
876fac5056Sskrll   The convention is that functions and types that are exposed to
8896d60fd4Smrgmultiple files internally are prefixed with '_cpp_', and are to be found
8996d60fd4Smrgin the file 'internal.h'.  Functions and types exposed to external
9096d60fd4Smrgclients are in 'cpplib.h', and prefixed with 'cpp_'.  For historical
916fac5056Sskrllreasons this is no longer quite true, but we should strive to stick to
926fac5056Sskrllit.
936fac5056Sskrll
9496d60fd4Smrg   We are striving to reduce the information exposed in 'cpplib.h' to
956fac5056Sskrllthe bare minimum necessary, and then to keep it there.  This makes clear
966fac5056Sskrllexactly what external clients are entitled to assume, and allows us to
976fac5056Sskrllchange internals in the future without worrying whether library clients
986fac5056Sskrllare perhaps relying on some kind of undocumented implementation-specific
996fac5056Sskrllbehavior.
1006fac5056Sskrll
1016fac5056Sskrll
1026fac5056SskrllFile: cppinternals.info,  Node: Lexer,  Next: Hash Nodes,  Prev: Conventions,  Up: Top
1036fac5056Sskrll
1046fac5056SskrllThe Lexer
1056fac5056Sskrll*********
1066fac5056Sskrll
1076fac5056SskrllOverview
1086fac5056Sskrll========
1096fac5056Sskrll
110e9e6e0f6SmrgThe lexer is contained in the file 'lex.cc'.  It is a hand-coded lexer,
1116fac5056Sskrlland not implemented as a state machine.  It can understand C, C++ and
1126fac5056SskrllObjective-C source code, and has been extended to allow reasonably
1136fac5056Sskrllsuccessful preprocessing of assembly language.  The lexer does not make
1146fac5056Sskrllan initial pass to strip out trigraphs and escaped newlines, but handles
1156fac5056Sskrllthem as they are encountered in a single pass of the input file.  It
1166fac5056Sskrllreturns preprocessing tokens individually, not a line at a time.
1176fac5056Sskrll
1186fac5056Sskrll   It is mostly transparent to users of the library, since the library's
11996d60fd4Smrginterface for obtaining the next token, 'cpp_get_token', takes care of
1206fac5056Sskrlllexing new tokens, handling directives, and expanding macros as
1216fac5056Sskrllnecessary.  However, the lexer does expose some functionality so that
1226fac5056Sskrllclients of the library can easily spell a given token, such as
12396d60fd4Smrg'cpp_spell_token' and 'cpp_token_len'.  These functions are useful when
1246fac5056Sskrllgenerating diagnostics, and for emitting the preprocessed output.
1256fac5056Sskrll
1266fac5056SskrllLexing a token
1276fac5056Sskrll==============
1286fac5056Sskrll
12996d60fd4SmrgLexing of an individual token is handled by '_cpp_lex_direct' and its
1306fac5056Sskrllsubroutines.  In its current form the code is quite complicated, with
1316fac5056Sskrllread ahead characters and such-like, since it strives to not step back
1326fac5056Sskrllin the character stream in preparation for handling non-ASCII file
1336fac5056Sskrllencodings.  The current plan is to convert any such files to UTF-8
1346fac5056Sskrllbefore processing them.  This complexity is therefore unnecessary and
1356fac5056Sskrllwill be removed, so I'll not discuss it further here.
1366fac5056Sskrll
13796d60fd4Smrg   The job of '_cpp_lex_direct' is simply to lex a token.  It is not
1386fac5056Sskrllresponsible for issues like directive handling, returning lookahead
1396fac5056Sskrlltokens directly, multiple-include optimization, or conditional block
140*4fe0f936Smrgskipping.  It necessarily has a minor rôle to play in memory management
14196d60fd4Smrgof lexed lines.  I discuss these issues in a separate section (*note
14296d60fd4SmrgLexing a line::).
1436fac5056Sskrll
1446fac5056Sskrll   The lexer places the token it lexes into storage pointed to by the
14596d60fd4Smrgvariable 'cur_token', and then increments it.  This variable is
1466fac5056Sskrllimportant for correct diagnostic positioning.  Unless a specific line
1476fac5056Sskrlland column are passed to the diagnostic routines, they will examine the
14896d60fd4Smrg'line' and 'col' values of the token just before the location that
14996d60fd4Smrg'cur_token' points to, and use that location to report the diagnostic.
1506fac5056Sskrll
1516fac5056Sskrll   The lexer does not consider whitespace to be a token in its own
1526fac5056Sskrllright.  If whitespace (other than a new line) precedes a token, it sets
15396d60fd4Smrgthe 'PREV_WHITE' bit in the token's flags.  Each token has its 'line'
15496d60fd4Smrgand 'col' variables set to the line and column of the first character of
15596d60fd4Smrgthe token.  This line number is the line number in the translation unit,
15696d60fd4Smrgand can be converted to a source (file, line) pair using the line map
15796d60fd4Smrgcode.
1586fac5056Sskrll
15996d60fd4Smrg   The first token on a logical, i.e. unescaped, line has the flag 'BOL'
16096d60fd4Smrgset for beginning-of-line.  This flag is intended for internal use, both
16196d60fd4Smrgto distinguish a '#' that begins a directive from one that doesn't, and
16296d60fd4Smrgto generate a call-back to clients that want to be notified about the
16396d60fd4Smrgstart of every non-directive line with tokens on it.  Clients cannot
16496d60fd4Smrgreliably determine this for themselves: the first token might be a
16596d60fd4Smrgmacro, and the tokens of a macro expansion do not have the 'BOL' flag
16696d60fd4Smrgset.  The macro expansion may even be empty, and the next token on the
16796d60fd4Smrgline certainly won't have the 'BOL' flag set.
1686fac5056Sskrll
1696fac5056Sskrll   New lines are treated specially; exactly how the lexer handles them
1706fac5056Sskrllis context-dependent.  The C standard mandates that directives are
1716fac5056Sskrllterminated by the first unescaped newline character, even if it appears
1726fac5056Sskrllin the middle of a macro expansion.  Therefore, if the state variable
17396d60fd4Smrg'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is
17496d60fd4Smrgnormally used to indicate end-of-file, to indicate end-of-directive.  In
17596d60fd4Smrga directive a 'CPP_EOF' token never means end-of-file.  Conveniently, if
17696d60fd4Smrgthe caller was 'collect_args', it already handles 'CPP_EOF' as if it
17796d60fd4Smrgwere end-of-file, and reports an error about an unterminated macro
17896d60fd4Smrgargument list.
1796fac5056Sskrll
1806fac5056Sskrll   The C standard also specifies that a new line in the middle of the
1816fac5056Sskrllarguments to a macro is treated as whitespace.  This white space is
182a41324a9Smrgimportant in case the macro argument is stringized.  The state variable
18396d60fd4Smrg'parsing_args' is nonzero when the preprocessor is collecting the
1846fac5056Sskrllarguments to a macro call.  It is set to 1 when looking for the opening
1856fac5056Sskrllparenthesis to a function-like macro, and 2 when collecting the actual
1866fac5056Sskrllarguments up to the closing parenthesis, since these two cases need to
1876fac5056Sskrllbe distinguished sometimes.  One such time is here: the lexer sets the
18896d60fd4Smrg'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args'
1896fac5056Sskrllis set to 2.  It doesn't set it if it meets a new line when
19096d60fd4Smrg'parsing_args' is 1, since then code like
1916fac5056Sskrll
1926fac5056Sskrll     #define foo() bar
1936fac5056Sskrll     foo
1946fac5056Sskrll     baz
1956fac5056Sskrll
19696d60fd4Smrgwould be output with an erroneous space before 'baz':
1976fac5056Sskrll
1986fac5056Sskrll     foo
1996fac5056Sskrll      baz
2006fac5056Sskrll
2016fac5056Sskrll   This is a good example of the subtlety of getting token spacing
2026fac5056Sskrllcorrect in the preprocessor; there are plenty of tests in the testsuite
2036fac5056Sskrllfor corner cases like this.
2046fac5056Sskrll
20596d60fd4Smrg   The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r'
2066fac5056Sskrllas a single new line indicator.  This allows it to transparently
2076fac5056Sskrllpreprocess MS-DOS, Macintosh and Unix files without their needing to
2086fac5056Sskrllpass through a special filter beforehand.
2096fac5056Sskrll
21096d60fd4Smrg   We also decided to treat a backslash, either '\' or the trigraph
21196d60fd4Smrg'??/', separated from one of the above newline indicators by non-comment
21296d60fd4Smrgwhitespace only, as intending to escape the newline.  It tends to be a
21396d60fd4Smrgtyping mistake, and cannot reasonably be mistaken for anything else in
21496d60fd4Smrgany of the C-family grammars.  Since handling it this way is not
21596d60fd4Smrgstrictly conforming to the ISO standard, the library issues a warning
21696d60fd4Smrgwherever it encounters it.
2176fac5056Sskrll
2186fac5056Sskrll   Handling newlines like this is made simpler by doing it in one place
21996d60fd4Smrgonly.  The function 'handle_newline' takes care of all newline
22096d60fd4Smrgcharacters, and 'skip_escaped_newlines' takes care of arbitrarily long
22196d60fd4Smrgsequences of escaped newlines, deferring to 'handle_newline' to handle
2226fac5056Sskrllthe newlines themselves.
2236fac5056Sskrll
2246fac5056Sskrll   The most painful aspect of lexing ISO-standard C and C++ is handling
2256fac5056Sskrlltrigraphs and backlash-escaped newlines.  Trigraphs are processed before
2266fac5056Sskrllany interpretation of the meaning of a character is made, and
2276fac5056Sskrllunfortunately there is a trigraph representation for a backslash, so it
22896d60fd4Smrgis possible for the trigraph '??/' to introduce an escaped newline.
2296fac5056Sskrll
2306fac5056Sskrll   Escaped newlines are tedious because theoretically they can occur
23196d60fd4Smrganywhere--between the '+' and '=' of the '+=' token, within the
23296d60fd4Smrgcharacters of an identifier, and even between the '*' and '/' that
2336fac5056Sskrllterminates a comment.  Moreover, you cannot be sure there is just
2346fac5056Sskrllone--there might be an arbitrarily long sequence of them.
2356fac5056Sskrll
23696d60fd4Smrg   So, for example, the routine that lexes a number, 'parse_number',
2376fac5056Sskrllcannot assume that it can scan forwards until the first non-number
23896d60fd4Smrgcharacter and be done with it, because this could be the '\' introducing
23996d60fd4Smrgan escaped newline, or the '?' introducing the trigraph sequence that
24096d60fd4Smrgrepresents the '\' of an escaped newline.  If it encounters a '?' or
24196d60fd4Smrg'\', it calls 'skip_escaped_newlines' to skip over any potential escaped
24296d60fd4Smrgnewlines before checking whether the number has been finished.
2436fac5056Sskrll
24496d60fd4Smrg   Similarly code in the main body of '_cpp_lex_direct' cannot simply
24596d60fd4Smrgcheck for a '=' after a '+' character to determine whether it has a '+='
24696d60fd4Smrgtoken; it needs to be prepared for an escaped newline of some sort.
24796d60fd4SmrgSuch cases use the function 'get_effective_char', which returns the
24896d60fd4Smrgfirst character after any intervening escaped newlines.
2496fac5056Sskrll
2506fac5056Sskrll   The lexer needs to keep track of the correct column position,
25196d60fd4Smrgincluding counting tabs as specified by the '-ftabstop=' option.  This
2526fac5056Sskrllshould be done even within C-style comments; they can appear in the
2536fac5056Sskrllmiddle of a line, and we want to report diagnostics in the correct
2546fac5056Sskrllposition for text appearing after the end of the comment.
2556fac5056Sskrll
25696d60fd4Smrg   Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may
25796d60fd4Smrgbe invalid and require a diagnostic.  However, if they appear in a macro
25896d60fd4Smrgexpansion we don't want to complain with each use of the macro.  It is
25996d60fd4Smrgtherefore best to catch them during the lexing stage, in
26096d60fd4Smrg'parse_identifier'.  In both cases, whether a diagnostic is needed or
2616fac5056Sskrllnot is dependent upon the lexer's state.  For example, we don't want to
2626fac5056Sskrllissue a diagnostic for re-poisoning a poisoned identifier, or for using
26396d60fd4Smrg'__VA_ARGS__' in the expansion of a variable-argument macro.  Therefore
26496d60fd4Smrg'parse_identifier' makes use of state flags to determine whether a
2656fac5056Sskrlldiagnostic is appropriate.  Since we change state on a per-token basis,
2666fac5056Sskrlland don't lex whole lines at a time, this is not a problem.
2676fac5056Sskrll
2686fac5056Sskrll   Another place where state flags are used to change behavior is whilst
26996d60fd4Smrglexing header names.  Normally, a '<' would be lexed as a single token.
27096d60fd4SmrgAfter a '#include' directive, though, it should be lexed as a single
27196d60fd4Smrgtoken as far as the nearest '>' character.  Note that we don't allow the
27296d60fd4Smrgterminators of header names to be escaped; the first '"' or '>'
2736fac5056Sskrllterminates the header name.
2746fac5056Sskrll
2756fac5056Sskrll   Interpretation of some character sequences depends upon whether we
2766fac5056Sskrllare lexing C, C++ or Objective-C, and on the revision of the standard in
27796d60fd4Smrgforce.  For example, '::' is a single token in C++, but in C it is two
27896d60fd4Smrgseparate ':' tokens and almost certainly a syntax error.  Such cases are
27996d60fd4Smrghandled by '_cpp_lex_direct' based upon command-line flags stored in the
28096d60fd4Smrg'cpp_options' structure.
2816fac5056Sskrll
2826fac5056Sskrll   Once a token has been lexed, it leads an independent existence.  The
2836fac5056Sskrllspelling of numbers, identifiers and strings is copied to permanent
2846fac5056Sskrllstorage from the original input buffer, so a token remains valid and
28596d60fd4Smrgcorrect even if its source buffer is freed with '_cpp_pop_buffer'.  The
2866fac5056Sskrllstorage holding the spellings of such tokens remains until the client
2876fac5056Sskrllprogram calls cpp_destroy, probably at the end of the translation unit.
2886fac5056Sskrll
2896fac5056SskrllLexing a line
2906fac5056Sskrll=============
2916fac5056Sskrll
2926fac5056SskrllWhen the preprocessor was changed to return pointers to tokens, one
2936fac5056Sskrllfeature I wanted was some sort of guarantee regarding how long a
2946fac5056Sskrllreturned pointer remains valid.  This is important to the stand-alone
2956fac5056Sskrllpreprocessor, the future direction of the C family front ends, and even
2966fac5056Sskrllto cpplib itself internally.
2976fac5056Sskrll
2986fac5056Sskrll   Occasionally the preprocessor wants to be able to peek ahead in the
2996fac5056Sskrlltoken stream.  For example, after the name of a function-like macro, it
3006fac5056Sskrllwants to check the next token to see if it is an opening parenthesis.
3016fac5056SskrllAnother example is that, after reading the first few tokens of a
30296d60fd4Smrg'#pragma' directive and not recognizing it as a registered pragma, it
3036fac5056Sskrllwants to backtrack and allow the user-defined handler for unknown
30496d60fd4Smrgpragmas to access the full '#pragma' token stream.  The stand-alone
3056fac5056Sskrllpreprocessor wants to be able to test the current token with the
3066fac5056Sskrllprevious one to see if a space needs to be inserted to preserve their
3076fac5056Sskrllseparate tokenization upon re-lexing (paste avoidance), so it needs to
3086fac5056Sskrllbe sure the pointer to the previous token is still valid.  The
3096fac5056Sskrllrecursive-descent C++ parser wants to be able to perform tentative
3106fac5056Sskrllparsing arbitrarily far ahead in the token stream, and then to be able
3116fac5056Sskrllto jump back to a prior position in that stream if necessary.
3126fac5056Sskrll
3136fac5056Sskrll   The rule I chose, which is fairly natural, is to arrange that the
3146fac5056Sskrllpreprocessor lex all tokens on a line consecutively into a token buffer,
3156fac5056Sskrllwhich I call a "token run", and when meeting an unescaped new line
3166fac5056Sskrll(newlines within comments do not count either), to start lexing back at
31796d60fd4Smrgthe beginning of the run.  Note that we do _not_ lex a line of tokens at
31896d60fd4Smrgonce; if we did that 'parse_identifier' would not have state flags
3196fac5056Sskrllavailable to warn about invalid identifiers (*note Invalid
3206fac5056Sskrllidentifiers::).
3216fac5056Sskrll
3226fac5056Sskrll   In other words, accessing tokens that appeared earlier in the current
3236fac5056Sskrllline is valid, but since each logical line overwrites the tokens of the
3246fac5056Sskrllprevious line, tokens from prior lines are unavailable.  In particular,
3256fac5056Sskrllsince a directive only occupies a single logical line, this means that
32696d60fd4Smrgthe directive handlers like the '#pragma' handler can jump around in the
32796d60fd4Smrgdirective's tokens if necessary.
3286fac5056Sskrll
3296fac5056Sskrll   Two issues remain: what about tokens that arise from macro
33096d60fd4Smrgexpansions, and what happens when we have a long line that overflows the
33196d60fd4Smrgtoken run?
3326fac5056Sskrll
3336fac5056Sskrll   Since we promise clients that we preserve the validity of pointers
3346fac5056Sskrllthat we have already returned for tokens that appeared earlier in the
33596d60fd4Smrgline, we cannot reallocate the run.  Instead, on overflow it is expanded
33696d60fd4Smrgby chaining a new token run on to the end of the existing one.
3376fac5056Sskrll
3386fac5056Sskrll   The tokens forming a macro's replacement list are collected by the
33996d60fd4Smrg'#define' handler, and placed in storage that is only freed by
34096d60fd4Smrg'cpp_destroy'.  So if a macro is expanded in the line of tokens, the
3416fac5056Sskrllpointers to the tokens of its expansion that are returned will always
3426fac5056Sskrllremain valid.  However, macros are a little trickier than that, since
3436fac5056Sskrllthey give rise to three sources of fresh tokens.  They are the built-in
34496d60fd4Smrgmacros like '__LINE__', and the '#' and '##' operators for stringizing
345a41324a9Smrgand token pasting.  I handled this by allocating space for these tokens
34696d60fd4Smrgfrom the lexer's token run chain.  This means they automatically receive
34796d60fd4Smrgthe same lifetime guarantees as lexed tokens, and we don't need to
34896d60fd4Smrgconcern ourselves with freeing them.
3496fac5056Sskrll
3506fac5056Sskrll   Lexing into a line of tokens solves some of the token memory
3516fac5056Sskrllmanagement issues, but not all.  The opening parenthesis after a
3526fac5056Sskrllfunction-like macro name might lie on a different line, and the front
3536fac5056Sskrllends definitely want the ability to look ahead past the end of the
3546fac5056Sskrllcurrent line.  So cpplib only moves back to the start of the token run
35596d60fd4Smrgat the end of a line if the variable 'keep_tokens' is zero.
3566fac5056SskrllLine-buffering is quite natural for the preprocessor, and as a result
3576fac5056Sskrllthe only time cpplib needs to increment this variable is whilst looking
3586fac5056Sskrllfor the opening parenthesis to, and reading the arguments of, a
35996d60fd4Smrgfunction-like macro.  In the near future cpplib will export an interface
36096d60fd4Smrgto increment and decrement this variable, so that clients can share full
36196d60fd4Smrgcontrol over the lifetime of token pointers too.
3626fac5056Sskrll
36396d60fd4Smrg   The routine '_cpp_lex_token' handles moving to new token runs,
36496d60fd4Smrgcalling '_cpp_lex_direct' to lex new tokens, or returning
3656fac5056Sskrllpreviously-lexed tokens if we stepped back in the token stream.  It also
36696d60fd4Smrgchecks each token for the 'BOL' flag, which might indicate a directive
3676fac5056Sskrllthat needs to be handled, or require a start-of-line call-back to be
36896d60fd4Smrgmade.  '_cpp_lex_token' also handles skipping over tokens in failed
3696fac5056Sskrllconditional blocks, and invalidates the control macro of the
3706fac5056Sskrllmultiple-include optimization if a token was successfully lexed outside
3716fac5056Sskrlla directive.  In other words, its callers do not need to concern
3726fac5056Sskrllthemselves with such issues.
3736fac5056Sskrll
3746fac5056Sskrll
3756fac5056SskrllFile: cppinternals.info,  Node: Hash Nodes,  Next: Macro Expansion,  Prev: Lexer,  Up: Top
3766fac5056Sskrll
3776fac5056SskrllHash Nodes
3786fac5056Sskrll**********
3796fac5056Sskrll
3806fac5056SskrllWhen cpplib encounters an "identifier", it generates a hash code for it
3816fac5056Sskrlland stores it in the hash table.  By "identifier" we mean tokens with
38296d60fd4Smrgtype 'CPP_NAME'; this includes identifiers in the usual C sense, as well
38396d60fd4Smrgas keywords, directive names, macro names and so on.  For example, all
38496d60fd4Smrgof 'pragma', 'int', 'foo' and '__GNUC__' are identifiers and hashed when
38596d60fd4Smrglexed.
3866fac5056Sskrll
3876fac5056Sskrll   Each node in the hash table contain various information about the
3886fac5056Sskrllidentifier it represents.  For example, its length and type.  At any one
3896fac5056Sskrlltime, each identifier falls into exactly one of three categories:
3906fac5056Sskrll
3916fac5056Sskrll   * Macros
3926fac5056Sskrll
3936fac5056Sskrll     These have been declared to be macros, either on the command line
39496d60fd4Smrg     or with '#define'.  A few, such as '__TIME__' are built-ins entered
39596d60fd4Smrg     in the hash table during initialization.  The hash node for a
39696d60fd4Smrg     normal macro points to a structure with more information about the
39796d60fd4Smrg     macro, such as whether it is function-like, how many arguments it
39896d60fd4Smrg     takes, and its expansion.  Built-in macros are flagged as special,
39996d60fd4Smrg     and instead contain an enum indicating which of the various
40096d60fd4Smrg     built-in macros it is.
4016fac5056Sskrll
4026fac5056Sskrll   * Assertions
4036fac5056Sskrll
40496d60fd4Smrg     Assertions are in a separate namespace to macros.  To enforce this,
40596d60fd4Smrg     cpp actually prepends a '#' character before hashing and entering
40696d60fd4Smrg     it in the hash table.  An assertion's node points to a chain of
40796d60fd4Smrg     answers to that assertion.
4086fac5056Sskrll
4096fac5056Sskrll   * Void
4106fac5056Sskrll
4116fac5056Sskrll     Everything else falls into this category--an identifier that is not
4126fac5056Sskrll     currently a macro, or a macro that has since been undefined with
41396d60fd4Smrg     '#undef'.
4146fac5056Sskrll
4156fac5056Sskrll     When preprocessing C++, this category also includes the named
41696d60fd4Smrg     operators, such as 'xor'.  In expressions these behave like the
4176fac5056Sskrll     operators they represent, but in contexts where the spelling of a
4186fac5056Sskrll     token matters they are spelt differently.  This spelling
4196fac5056Sskrll     distinction is relevant when they are operands of the stringizing
42096d60fd4Smrg     and pasting macro operators '#' and '##'.  Named operator hash
4216fac5056Sskrll     nodes are flagged, both to catch the spelling distinction and to
4226fac5056Sskrll     prevent them from being defined as macros.
4236fac5056Sskrll
4246fac5056Sskrll   The same identifiers share the same hash node.  Since each identifier
4256fac5056Sskrlltoken, after lexing, contains a pointer to its hash node, this is used
4266fac5056Sskrllto provide rapid lookup of various information.  For example, when
42796d60fd4Smrgparsing a '#define' statement, CPP flags each argument's identifier hash
42896d60fd4Smrgnode with the index of that argument.  This makes duplicated argument
42996d60fd4Smrgchecking an O(1) operation for each argument.  Similarly, for each
43096d60fd4Smrgidentifier in the macro's expansion, lookup to see if it is an argument,
43196d60fd4Smrgand which argument it is, is also an O(1) operation.  Further, each
43296d60fd4Smrgdirective name, such as 'endif', has an associated directive enum stored
43396d60fd4Smrgin its hash node, so that directive lookup is also O(1).
4346fac5056Sskrll
4356fac5056Sskrll
4366fac5056SskrllFile: cppinternals.info,  Node: Macro Expansion,  Next: Token Spacing,  Prev: Hash Nodes,  Up: Top
4376fac5056Sskrll
4386fac5056SskrllMacro Expansion Algorithm
4396fac5056Sskrll*************************
4406fac5056Sskrll
4416fac5056SskrllMacro expansion is a tricky operation, fraught with nasty corner cases
4426fac5056Sskrlland situations that render what you thought was a nifty way to optimize
4436fac5056Sskrllthe preprocessor's expansion algorithm wrong in quite subtle ways.
4446fac5056Sskrll
4456fac5056Sskrll   I strongly recommend you have a good grasp of how the C and C++
44696d60fd4Smrgstandards require macros to be expanded before diving into this section,
44796d60fd4Smrglet alone the code!.  If you don't have a clear mental picture of how
44896d60fd4Smrgthings like nested macro expansion, stringizing and token pasting are
44996d60fd4Smrgsupposed to work, damage to your sanity can quickly result.
4506fac5056Sskrll
4516fac5056SskrllInternal representation of macros
4526fac5056Sskrll=================================
4536fac5056Sskrll
4546fac5056SskrllThe preprocessor stores macro expansions in tokenized form.  This saves
45596d60fd4Smrgrepeated lexing passes during expansion, at the cost of a small increase
45696d60fd4Smrgin memory consumption on average.  The tokens are stored contiguously in
45796d60fd4Smrgmemory, so a pointer to the first one and a token count is all you need
45896d60fd4Smrgto get the replacement list of a macro.
4596fac5056Sskrll
4606fac5056Sskrll   If the macro is a function-like macro the preprocessor also stores
4616fac5056Sskrllits parameters, in the form of an ordered list of pointers to the hash
4626fac5056Sskrlltable entry of each parameter's identifier.  Further, in the macro's
4636fac5056Sskrllstored expansion each occurrence of a parameter is replaced with a
46496d60fd4Smrgspecial token of type 'CPP_MACRO_ARG'.  Each such token holds the index
46596d60fd4Smrgof the parameter it represents in the parameter list, which allows rapid
46696d60fd4Smrgreplacement of parameters with their arguments during expansion.
4676fac5056SskrllDespite this optimization it is still necessary to store the original
46896d60fd4Smrgparameters to the macro, both for dumping with e.g., '-dD', and to warn
4696fac5056Sskrllabout non-trivial macro redefinitions when the parameter names have
4706fac5056Sskrllchanged.
4716fac5056Sskrll
4726fac5056SskrllMacro expansion overview
4736fac5056Sskrll========================
4746fac5056Sskrll
4756fac5056SskrllThe preprocessor maintains a "context stack", implemented as a linked
47696d60fd4Smrglist of 'cpp_context' structures, which together represent the macro
47796d60fd4Smrgexpansion state at any one time.  The 'struct cpp_reader' member
47896d60fd4Smrgvariable 'context' points to the current top of this stack.  The top
4796fac5056Sskrllnormally holds the unexpanded replacement list of the innermost macro
4806fac5056Sskrllunder expansion, except when cpplib is about to pre-expand an argument,
4816fac5056Sskrllin which case it holds that argument's unexpanded tokens.
4826fac5056Sskrll
4836fac5056Sskrll   When there are no macros under expansion, cpplib is in "base
48496d60fd4Smrgcontext".  All contexts other than the base context contain a contiguous
48596d60fd4Smrglist of tokens delimited by a starting and ending token.  When not in
48696d60fd4Smrgbase context, cpplib obtains the next token from the list of the top
48796d60fd4Smrgcontext.  If there are no tokens left in the list, it pops that context
48896d60fd4Smrgoff the stack, and subsequent ones if necessary, until an unexhausted
48996d60fd4Smrgcontext is found or it returns to base context.  In base context, cpplib
49096d60fd4Smrgreads tokens directly from the lexer.
4916fac5056Sskrll
4926fac5056Sskrll   If it encounters an identifier that is both a macro and enabled for
4936fac5056Sskrllexpansion, cpplib prepares to push a new context for that macro on the
49496d60fd4Smrgstack by calling the routine 'enter_macro_context'.  When this routine
4956fac5056Sskrllreturns, the new context will contain the unexpanded tokens of the
4966fac5056Sskrllreplacement list of that macro.  In the case of function-like macros,
49796d60fd4Smrg'enter_macro_context' also replaces any parameters in the replacement
49896d60fd4Smrglist, stored as 'CPP_MACRO_ARG' tokens, with the appropriate macro
4996fac5056Sskrllargument.  If the standard requires that the parameter be replaced with
5006fac5056Sskrllits expanded argument, the argument will have been fully macro expanded
5016fac5056Sskrllfirst.
5026fac5056Sskrll
50396d60fd4Smrg   'enter_macro_context' also handles special macros like '__LINE__'.
5046fac5056SskrllAlthough these macros expand to a single token which cannot contain any
50596d60fd4Smrgfurther macros, for reasons of token spacing (*note Token Spacing::) and
50696d60fd4Smrgsimplicity of implementation, cpplib handles these special macros by
50796d60fd4Smrgpushing a context containing just that one token.
5086fac5056Sskrll
50996d60fd4Smrg   The final thing that 'enter_macro_context' does before returning is
51096d60fd4Smrgto mark the macro disabled for expansion (except for special macros like
51196d60fd4Smrg'__TIME__').  The macro is re-enabled when its context is later popped
51296d60fd4Smrgfrom the context stack, as described above.  This strict ordering
51396d60fd4Smrgensures that a macro is disabled whilst its expansion is being scanned,
51496d60fd4Smrgbut that it is _not_ disabled whilst any arguments to it are being
51596d60fd4Smrgexpanded.
5166fac5056Sskrll
5176fac5056SskrllScanning the replacement list for macros to expand
5186fac5056Sskrll==================================================
5196fac5056Sskrll
52096d60fd4SmrgThe C standard states that, after any parameters have been replaced with
52196d60fd4Smrgtheir possibly-expanded arguments, the replacement list is scanned for
52296d60fd4Smrgnested macros.  Further, any identifiers in the replacement list that
52396d60fd4Smrgare not expanded during this scan are never again eligible for expansion
52496d60fd4Smrgin the future, if the reason they were not expanded is that the macro in
52596d60fd4Smrgquestion was disabled.
5266fac5056Sskrll
5276fac5056Sskrll   Clearly this latter condition can only apply to tokens resulting from
5286fac5056Sskrllargument pre-expansion.  Other tokens never have an opportunity to be
5296fac5056Sskrllre-tested for expansion.  It is possible for identifiers that are
5306fac5056Sskrllfunction-like macros to not expand initially but to expand during a
5316fac5056Sskrlllater scan.  This occurs when the identifier is the last token of an
5326fac5056Sskrllargument (and therefore originally followed by a comma or a closing
5336fac5056Sskrllparenthesis in its macro's argument list), and when it replaces its
5346fac5056Sskrllparameter in the macro's replacement list, the subsequent token happens
5356fac5056Sskrllto be an opening parenthesis (itself possibly the first token of an
5366fac5056Sskrllargument).
5376fac5056Sskrll
5386fac5056Sskrll   It is important to note that when cpplib reads the last token of a
5396fac5056Sskrllgiven context, that context still remains on the stack.  Only when
5406fac5056Sskrlllooking for the _next_ token do we pop it off the stack and drop to a
5416fac5056Sskrlllower context.  This makes backing up by one token easy, but more
5426fac5056Sskrllimportantly ensures that the macro corresponding to the current context
5436fac5056Sskrllis still disabled when we are considering the last token of its
54496d60fd4Smrgreplacement list for expansion (or indeed expanding it).  As an example,
54596d60fd4Smrgwhich illustrates many of the points above, consider
5466fac5056Sskrll
5476fac5056Sskrll     #define foo(x) bar x
5486fac5056Sskrll     foo(foo) (2)
5496fac5056Sskrll
55096d60fd4Smrgwhich fully expands to 'bar foo (2)'.  During pre-expansion of the
55196d60fd4Smrgargument, 'foo' does not expand even though the macro is enabled, since
5526fac5056Sskrllit has no following parenthesis [pre-expansion of an argument only uses
5536fac5056Sskrlltokens from that argument; it cannot take tokens from whatever follows
55496d60fd4Smrgthe macro invocation].  This still leaves the argument token 'foo'
5556fac5056Sskrlleligible for future expansion.  Then, when re-scanning after argument
55696d60fd4Smrgreplacement, the token 'foo' is rejected for expansion, and marked
55796d60fd4Smrgineligible for future expansion, since the macro is now disabled.  It is
55896d60fd4Smrgdisabled because the replacement list 'bar foo' of the macro is still on
55996d60fd4Smrgthe context stack.
5606fac5056Sskrll
5616fac5056Sskrll   If instead the algorithm looked for an opening parenthesis first and
5626fac5056Sskrllthen tested whether the macro were disabled it would be subtly wrong.
56396d60fd4SmrgIn the example above, the replacement list of 'foo' would be popped in
56496d60fd4Smrgthe process of finding the parenthesis, re-enabling 'foo' and expanding
5656fac5056Sskrllit a second time.
5666fac5056Sskrll
5676fac5056SskrllLooking for a function-like macro's opening parenthesis
5686fac5056Sskrll=======================================================
5696fac5056Sskrll
5706fac5056SskrllFunction-like macros only expand when immediately followed by a
5716fac5056Sskrllparenthesis.  To do this cpplib needs to temporarily disable macros and
5726fac5056Sskrllread the next token.  Unfortunately, because of spacing issues (*note
5736fac5056SskrllToken Spacing::), there can be fake padding tokens in-between, and if
57496d60fd4Smrgthe next real token is not a parenthesis cpplib needs to be able to back
57596d60fd4Smrgup that one token as well as retain the information in any intervening
57696d60fd4Smrgpadding tokens.
5776fac5056Sskrll
5786fac5056Sskrll   Backing up more than one token when macros are involved is not
5796fac5056Sskrllpermitted by cpplib, because in general it might involve issues like
5806fac5056Sskrllrestoring popped contexts onto the context stack, which are too hard.
58196d60fd4SmrgInstead, searching for the parenthesis is handled by a special function,
58296d60fd4Smrg'funlike_invocation_p', which remembers padding information as it reads
58396d60fd4Smrgtokens.  If the next real token is not an opening parenthesis, it backs
58496d60fd4Smrgup that one token, and then pushes an extra context just containing the
58596d60fd4Smrgpadding information if necessary.
5866fac5056Sskrll
5876fac5056SskrllMarking tokens ineligible for future expansion
5886fac5056Sskrll==============================================
5896fac5056Sskrll
5906fac5056SskrllAs discussed above, cpplib needs a way of marking tokens as
5916fac5056Sskrllunexpandable.  Since the tokens cpplib handles are read-only once they
5926fac5056Sskrllhave been lexed, it instead makes a copy of the token and adds the flag
59396d60fd4Smrg'NO_EXPAND' to the copy.
5946fac5056Sskrll
5956fac5056Sskrll   For efficiency and to simplify memory management by avoiding having
5966fac5056Sskrllto remember to free these tokens, they are allocated as temporary tokens
5976fac5056Sskrllfrom the lexer's current token run (*note Lexing a line::) using the
59896d60fd4Smrgfunction '_cpp_temp_token'.  The tokens are then re-used once the
5996fac5056Sskrllcurrent line of tokens has been read in.
6006fac5056Sskrll
6016fac5056Sskrll   This might sound unsafe.  However, tokens runs are not re-used at the
6026fac5056Sskrllend of a line if it happens to be in the middle of a macro argument
6036fac5056Sskrlllist, and cpplib only wants to back-up more than one lexer token in
6046fac5056Sskrllsituations where no macro expansion is involved, so the optimization is
6056fac5056Sskrllsafe.
6066fac5056Sskrll
6076fac5056Sskrll
6086fac5056SskrllFile: cppinternals.info,  Node: Token Spacing,  Next: Line Numbering,  Prev: Macro Expansion,  Up: Top
6096fac5056Sskrll
6106fac5056SskrllToken Spacing
6116fac5056Sskrll*************
6126fac5056Sskrll
6136fac5056SskrllFirst, consider an issue that only concerns the stand-alone
6146fac5056Sskrllpreprocessor: there needs to be a guarantee that re-reading its
6156fac5056Sskrllpreprocessed output results in an identical token stream.  Without
6166fac5056Sskrlltaking special measures, this might not be the case because of macro
6176fac5056Sskrllsubstitution.  For example:
6186fac5056Sskrll
6196fac5056Sskrll     #define PLUS +
6206fac5056Sskrll     #define EMPTY
6216fac5056Sskrll     #define f(x) =x=
6226fac5056Sskrll     +PLUS -EMPTY- PLUS+ f(=)
6236fac5056Sskrll             ==> + + - - + + = = =
6246fac5056Sskrll     _not_
6256fac5056Sskrll             ==> ++ -- ++ ===
6266fac5056Sskrll
6276fac5056Sskrll   One solution would be to simply insert a space between all adjacent
6286fac5056Sskrlltokens.  However, we would like to keep space insertion to a minimum,
6296fac5056Sskrllboth for aesthetic reasons and because it causes problems for people who
6306fac5056Sskrllstill try to abuse the preprocessor for things like Fortran source and
6316fac5056SskrllMakefiles.
6326fac5056Sskrll
63396d60fd4Smrg   For now, just notice that when tokens are added (or removed, as shown
63496d60fd4Smrgby the 'EMPTY' example) from the original lexed token stream, we need to
63596d60fd4Smrgcheck for accidental token pasting.  We call this "paste avoidance".
63696d60fd4SmrgToken addition and removal can only occur because of macro expansion,
63796d60fd4Smrgbut accidental pasting can occur in many places: both before and after
63896d60fd4Smrgeach macro replacement, each argument replacement, and additionally each
63996d60fd4Smrgtoken created by the '#' and '##' operators.
6406fac5056Sskrll
64196d60fd4Smrg   Look at how the preprocessor gets whitespace output correct normally.
64296d60fd4SmrgThe 'cpp_token' structure contains a flags byte, and one of those flags
64396d60fd4Smrgis 'PREV_WHITE'.  This is flagged by the lexer, and indicates that the
64496d60fd4Smrgtoken was preceded by whitespace of some form other than a new line.
64596d60fd4SmrgThe stand-alone preprocessor can use this flag to decide whether to
64696d60fd4Smrginsert a space between tokens in the output.
6476fac5056Sskrll
6486fac5056Sskrll   Now consider the result of the following macro expansion:
6496fac5056Sskrll
6506fac5056Sskrll     #define add(x, y, z) x + y +z;
6516fac5056Sskrll     sum = add (1,2, 3);
6526fac5056Sskrll             ==> sum = 1 + 2 +3;
6536fac5056Sskrll
65496d60fd4Smrg   The interesting thing here is that the tokens '1' and '2' are output
65596d60fd4Smrgwith a preceding space, and '3' is output without a preceding space, but
65696d60fd4Smrgwhen lexed none of these tokens had that property.  Careful
65796d60fd4Smrgconsideration reveals that '1' gets its preceding whitespace from the
65896d60fd4Smrgspace preceding 'add' in the macro invocation, _not_ replacement list.
65996d60fd4Smrg'2' gets its whitespace from the space preceding the parameter 'y' in
66096d60fd4Smrgthe macro replacement list, and '3' has no preceding space because
66196d60fd4Smrgparameter 'z' has none in the replacement list.
6626fac5056Sskrll
6636fac5056Sskrll   Once lexed, tokens are effectively fixed and cannot be altered, since
6646fac5056Sskrllpointers to them might be held in many places, in particular by
6656fac5056Sskrllin-progress macro expansions.  So instead of modifying the two tokens
66696d60fd4Smrgabove, the preprocessor inserts a special token, which I call a "padding
66796d60fd4Smrgtoken", into the token stream to indicate that spacing of the subsequent
66896d60fd4Smrgtoken is special.  The preprocessor inserts padding tokens in front of
66996d60fd4Smrgevery macro expansion and expanded macro argument.  These point to a
67096d60fd4Smrg"source token" from which the subsequent real token should inherit its
67196d60fd4Smrgspacing.  In the above example, the source tokens are 'add' in the macro
67296d60fd4Smrginvocation, and 'y' and 'z' in the macro replacement list, respectively.
6736fac5056Sskrll
67496d60fd4Smrg   It is quite easy to get multiple padding tokens in a row, for example
67596d60fd4Smrgif a macro's first replacement token expands straight into another
67696d60fd4Smrgmacro.
6776fac5056Sskrll
6786fac5056Sskrll     #define foo bar
6796fac5056Sskrll     #define bar baz
6806fac5056Sskrll     [foo]
6816fac5056Sskrll             ==> [baz]
6826fac5056Sskrll
68396d60fd4Smrg   Here, two padding tokens are generated with sources the 'foo' token
68496d60fd4Smrgbetween the brackets, and the 'bar' token from foo's replacement list,
68596d60fd4Smrgrespectively.  Clearly the first padding token is the one to use, so the
68696d60fd4Smrgoutput code should contain a rule that the first padding token in a
6876fac5056Sskrllsequence is the one that matters.
6886fac5056Sskrll
6896fac5056Sskrll   But what if a macro expansion is left?  Adjusting the above example
6906fac5056Sskrllslightly:
6916fac5056Sskrll
6926fac5056Sskrll     #define foo bar
6936fac5056Sskrll     #define bar EMPTY baz
6946fac5056Sskrll     #define EMPTY
6956fac5056Sskrll     [foo] EMPTY;
6966fac5056Sskrll             ==> [ baz] ;
6976fac5056Sskrll
69896d60fd4Smrg   As shown, now there should be a space before 'baz' and the semicolon
6996fac5056Sskrllin the output.
7006fac5056Sskrll
70196d60fd4Smrg   The rules we decided above fail for 'baz': we generate three padding
70296d60fd4Smrgtokens, one per macro invocation, before the token 'baz'.  We would then
70396d60fd4Smrghave it take its spacing from the first of these, which carries source
70496d60fd4Smrgtoken 'foo' with no leading space.
7056fac5056Sskrll
7066fac5056Sskrll   It is vital that cpplib get spacing correct in these examples since
707a41324a9Smrgany of these macro expansions could be stringized, where spacing
7086fac5056Sskrllmatters.
7096fac5056Sskrll
7106fac5056Sskrll   So, this demonstrates that not just entering macro and argument
7116fac5056Sskrllexpansions, but leaving them requires special handling too.  I made
71296d60fd4Smrgcpplib insert a padding token with a 'NULL' source token when leaving
7136fac5056Sskrllmacro expansions, as well as after each replaced argument in a macro's
7146fac5056Sskrllreplacement list.  It also inserts appropriate padding tokens on either
71596d60fd4Smrgside of tokens created by the '#' and '##' operators.  I expanded the
71696d60fd4Smrgrule so that, if we see a padding token with a 'NULL' source token,
7176fac5056Sskrll_and_ that source token has no leading space, then we behave as if we
7186fac5056Sskrllhave seen no padding tokens at all.  A quick check shows this rule will
7196fac5056Sskrllthen get the above example correct as well.
7206fac5056Sskrll
7216fac5056Sskrll   Now a relationship with paste avoidance is apparent: we have to be
7226fac5056Sskrllcareful about paste avoidance in exactly the same locations we have
7236fac5056Sskrllpadding tokens in order to get white space correct.  This makes
7246fac5056Sskrllimplementation of paste avoidance easy: wherever the stand-alone
7256fac5056Sskrllpreprocessor is fixing up spacing because of padding tokens, and it
7266fac5056Sskrllturns out that no space is needed, it has to take the extra step to
7276fac5056Sskrllcheck that a space is not needed after all to avoid an accidental paste.
72896d60fd4SmrgThe function 'cpp_avoid_paste' advises whether a space is required
7296fac5056Sskrllbetween two consecutive tokens.  To avoid excessive spacing, it tries
7306fac5056Sskrllhard to only require a space if one is likely to be necessary, but for
7316fac5056Sskrllreasons of efficiency it is slightly conservative and might recommend a
7326fac5056Sskrllspace where one is not strictly needed.
7336fac5056Sskrll
7346fac5056Sskrll
7356fac5056SskrllFile: cppinternals.info,  Node: Line Numbering,  Next: Guard Macros,  Prev: Token Spacing,  Up: Top
7366fac5056Sskrll
7376fac5056SskrllLine numbering
7386fac5056Sskrll**************
7396fac5056Sskrll
7406fac5056SskrllJust which line number anyway?
7416fac5056Sskrll==============================
7426fac5056Sskrll
7436fac5056SskrllThere are three reasonable requirements a cpplib client might have for
7446fac5056Sskrllthe line number of a token passed to it:
7456fac5056Sskrll
7466fac5056Sskrll   * The source line it was lexed on.
7476fac5056Sskrll   * The line it is output on.  This can be different to the line it was
7486fac5056Sskrll     lexed on if, for example, there are intervening escaped newlines or
7496fac5056Sskrll     C-style comments.  For example:
7506fac5056Sskrll
7516fac5056Sskrll          foo /* A long
7526fac5056Sskrll          comment */ bar \
7536fac5056Sskrll          baz
7546fac5056Sskrll          =>
7556fac5056Sskrll          foo bar baz
7566fac5056Sskrll
7576fac5056Sskrll   * If the token results from a macro expansion, the line of the macro
7586fac5056Sskrll     name, or possibly the line of the closing parenthesis in the case
7596fac5056Sskrll     of function-like macro expansion.
7606fac5056Sskrll
76196d60fd4Smrg   The 'cpp_token' structure contains 'line' and 'col' members.  The
7626fac5056Sskrlllexer fills these in with the line and column of the first character of
7636fac5056Sskrllthe token.  Consequently, but maybe unexpectedly, a token from the
7646fac5056Sskrllreplacement list of a macro expansion carries the location of the token
76596d60fd4Smrgwithin the '#define' directive, because cpplib expands a macro by
7666fac5056Sskrllreturning pointers to the tokens in its replacement list.  The current
76796d60fd4Smrgimplementation of cpplib assigns tokens created from built-in macros and
76896d60fd4Smrgthe '#' and '##' operators the location of the most recently lexed
7696fac5056Sskrlltoken.  This is a because they are allocated from the lexer's token
7706fac5056Sskrllruns, and because of the way the diagnostic routines infer the
7716fac5056Sskrllappropriate location to report.
7726fac5056Sskrll
7736fac5056Sskrll   The diagnostic routines in cpplib display the location of the most
7746fac5056Sskrllrecently _lexed_ token, unless they are passed a specific line and
7756fac5056Sskrllcolumn to report.  For diagnostics regarding tokens that arise from
7766fac5056Sskrllmacro expansions, it might also be helpful for the user to see the
7776fac5056Sskrlloriginal location in the macro definition that the token came from.
7786fac5056SskrllSince that is exactly the information each token carries, such an
7796fac5056Sskrllenhancement could be made relatively easily in future.
7806fac5056Sskrll
7816fac5056Sskrll   The stand-alone preprocessor faces a similar problem when determining
7826fac5056Sskrllthe correct line to output the token on: the position attached to a
7836fac5056Sskrlltoken is fairly useless if the token came from a macro expansion.  All
7846fac5056Sskrlltokens on a logical line should be output on its first physical line, so
7856fac5056Sskrllthe token's reported location is also wrong if it is part of a physical
7866fac5056Sskrllline other than the first.
7876fac5056Sskrll
7886fac5056Sskrll   To solve these issues, cpplib provides a callback that is generated
7896fac5056Sskrllwhenever it lexes a preprocessing token that starts a new logical line
79096d60fd4Smrgother than a directive.  It passes this token (which may be a 'CPP_EOF'
7916fac5056Sskrlltoken indicating the end of the translation unit) to the callback
79296d60fd4Smrgroutine, which can then use the line and column of this token to produce
79396d60fd4Smrgcorrect output.
7946fac5056Sskrll
7956fac5056SskrllRepresentation of line numbers
7966fac5056Sskrll==============================
7976fac5056Sskrll
7986fac5056SskrllAs mentioned above, cpplib stores with each token the line number that
7996fac5056Sskrllit was lexed on.  In fact, this number is not the number of the line in
8006fac5056Sskrllthe source file, but instead bears more resemblance to the number of the
8016fac5056Sskrllline in the translation unit.
8026fac5056Sskrll
8036fac5056Sskrll   The preprocessor maintains a monotonic increasing line count, which
8046fac5056Sskrllis incremented at every new line character (and also at the end of any
8056fac5056Sskrllbuffer that does not end in a new line).  Since a line number of zero is
8066fac5056Sskrlluseful to indicate certain special states and conditions, this variable
8076fac5056Sskrllstarts counting from one.
8086fac5056Sskrll
8096fac5056Sskrll   This variable therefore uniquely enumerates each line in the
8106fac5056Sskrlltranslation unit.  With some simple infrastructure, it is straight
8116fac5056Sskrllforward to map from this to the original source file and line number
8126fac5056Sskrllpair, saving space whenever line number information needs to be saved.
813e9e6e0f6SmrgThe code the implements this mapping lies in the files 'line-map.cc' and
81496d60fd4Smrg'line-map.h'.
8156fac5056Sskrll
8166fac5056Sskrll   Command-line macros and assertions are implemented by pushing a
81796d60fd4Smrgbuffer containing the right hand side of an equivalent '#define' or
81896d60fd4Smrg'#assert' directive.  Some built-in macros are handled similarly.  Since
81996d60fd4Smrgthese are all processed before the first line of the main input file, it
82096d60fd4Smrgwill typically have an assigned line closer to twenty than to one.
8216fac5056Sskrll
8226fac5056Sskrll
8236fac5056SskrllFile: cppinternals.info,  Node: Guard Macros,  Next: Files,  Prev: Line Numbering,  Up: Top
8246fac5056Sskrll
8256fac5056SskrllThe Multiple-Include Optimization
8266fac5056Sskrll*********************************
8276fac5056Sskrll
8286fac5056SskrllHeader files are often of the form
8296fac5056Sskrll
8306fac5056Sskrll     #ifndef FOO
8316fac5056Sskrll     #define FOO
8326fac5056Sskrll     ...
8336fac5056Sskrll     #endif
8346fac5056Sskrll
8356fac5056Sskrllto prevent the compiler from processing them more than once.  The
8366fac5056Sskrllpreprocessor notices such header files, so that if the header file
83796d60fd4Smrgappears in a subsequent '#include' directive and 'FOO' is defined, then
8386fac5056Sskrllit is ignored and it doesn't preprocess or even re-open the file a
8396fac5056Sskrllsecond time.  This is referred to as the "multiple include
8406fac5056Sskrlloptimization".
8416fac5056Sskrll
8426fac5056Sskrll   Under what circumstances is such an optimization valid?  If the file
8436fac5056Sskrllwere included a second time, it can only be optimized away if that
8446fac5056Sskrllinclusion would result in no tokens to return, and no relevant
8456fac5056Sskrlldirectives to process.  Therefore the current implementation imposes
8466fac5056Sskrllrequirements and makes some allowances as follows:
8476fac5056Sskrll
84896d60fd4Smrg  1. There must be no tokens outside the controlling '#if'-'#endif'
8496fac5056Sskrll     pair, but whitespace and comments are permitted.
8506fac5056Sskrll
85196d60fd4Smrg  2. There must be no directives outside the controlling directive pair,
85296d60fd4Smrg     but the "null directive" (a line containing nothing other than a
85396d60fd4Smrg     single '#' and possibly whitespace) is permitted.
8546fac5056Sskrll
8556fac5056Sskrll  3. The opening directive must be of the form
8566fac5056Sskrll
8576fac5056Sskrll          #ifndef FOO
8586fac5056Sskrll
8596fac5056Sskrll     or
8606fac5056Sskrll
8616fac5056Sskrll          #if !defined FOO     [equivalently, #if !defined(FOO)]
8626fac5056Sskrll
86396d60fd4Smrg  4. In the second form above, the tokens forming the '#if' expression
8646fac5056Sskrll     must have come directly from the source file--no macro expansion
8656fac5056Sskrll     must have been involved.  This is because macro definitions can
86696d60fd4Smrg     change, and tracking whether or not a relevant change has been made
86796d60fd4Smrg     is not worth the implementation cost.
8686fac5056Sskrll
86996d60fd4Smrg  5. There can be no '#else' or '#elif' directives at the outer
8706fac5056Sskrll     conditional block level, because they would probably contain
8716fac5056Sskrll     something of interest to a subsequent pass.
8726fac5056Sskrll
8736fac5056Sskrll   First, when pushing a new file on the buffer stack,
87496d60fd4Smrg'_stack_include_file' sets the controlling macro 'mi_cmacro' to 'NULL',
87596d60fd4Smrgand sets 'mi_valid' to 'true'.  This indicates that the preprocessor has
87696d60fd4Smrgnot yet encountered anything that would invalidate the multiple-include
87796d60fd4Smrgoptimization.  As described in the next few paragraphs, these two
87896d60fd4Smrgvariables having these values effectively indicates top-of-file.
8796fac5056Sskrll
8806fac5056Sskrll   When about to return a token that is not part of a directive,
88196d60fd4Smrg'_cpp_lex_token' sets 'mi_valid' to 'false'.  This enforces the
8826fac5056Sskrllconstraint that tokens outside the controlling conditional block
8836fac5056Sskrllinvalidate the optimization.
8846fac5056Sskrll
88596d60fd4Smrg   The 'do_if', when appropriate, and 'do_ifndef' directive handlers
88696d60fd4Smrgpass the controlling macro to the function 'push_conditional'.  cpplib
8876fac5056Sskrllmaintains a stack of nested conditional blocks, and after processing
88896d60fd4Smrgevery opening conditional this function pushes an 'if_stack' structure
8896fac5056Sskrllonto the stack.  In this structure it records the controlling macro for
8906fac5056Sskrllthe block, provided there is one and we're at top-of-file (as described
89196d60fd4Smrgabove).  If an '#elif' or '#else' directive is encountered, the
89296d60fd4Smrgcontrolling macro for that block is cleared to 'NULL'.  Otherwise, it
89396d60fd4Smrgsurvives until the '#endif' closing the block, upon which 'do_endif'
89496d60fd4Smrgsets 'mi_valid' to true and stores the controlling macro in 'mi_cmacro'.
8956fac5056Sskrll
89696d60fd4Smrg   '_cpp_handle_directive' clears 'mi_valid' when processing any
8976fac5056Sskrlldirective other than an opening conditional and the null directive.
8986fac5056SskrllWith this, and requiring top-of-file to record a controlling macro, and
89996d60fd4Smrgno '#else' or '#elif' for it to survive and be copied to 'mi_cmacro' by
90096d60fd4Smrg'do_endif', we have enforced the absence of directives outside the main
9016fac5056Sskrllconditional block for the optimization to be on.
9026fac5056Sskrll
90396d60fd4Smrg   Note that whilst we are inside the conditional block, 'mi_valid' is
90496d60fd4Smrglikely to be reset to 'false', but this does not matter since the
90596d60fd4Smrgclosing '#endif' restores it to 'true' if appropriate.
9066fac5056Sskrll
90796d60fd4Smrg   Finally, since '_cpp_lex_direct' pops the file off the buffer stack
90896d60fd4Smrgat 'EOF' without returning a token, if the '#endif' directive was not
90996d60fd4Smrgfollowed by any tokens, 'mi_valid' is 'true' and '_cpp_pop_file_buffer'
9106fac5056Sskrllremembers the controlling macro associated with the file.  Subsequent
91196d60fd4Smrgcalls to 'stack_include_file' result in no buffer being pushed if the
9126fac5056Sskrllcontrolling macro is defined, effecting the optimization.
9136fac5056Sskrll
9146fac5056Sskrll   A quick word on how we handle the
9156fac5056Sskrll
9166fac5056Sskrll     #if !defined FOO
9176fac5056Sskrll
91896d60fd4Smrgcase.  '_cpp_parse_expr' and 'parse_defined' take steps to see whether
91996d60fd4Smrgthe three stages '!', 'defined-expression' and 'end-of-directive' occur
92096d60fd4Smrgin order in a '#if' expression.  If so, they return the guard macro to
92196d60fd4Smrg'do_if' in the variable 'mi_ind_cmacro', and otherwise set it to 'NULL'.
92296d60fd4Smrg'enter_macro_context' sets 'mi_valid' to false, so if a macro was
92396d60fd4Smrgexpanded whilst parsing any part of the expression, then the top-of-file
92496d60fd4Smrgtest in 'push_conditional' fails and the optimization is turned off.
9256fac5056Sskrll
9266fac5056Sskrll
9276fac5056SskrllFile: cppinternals.info,  Node: Files,  Next: Concept Index,  Prev: Guard Macros,  Up: Top
9286fac5056Sskrll
9296fac5056SskrllFile Handling
9306fac5056Sskrll*************
9316fac5056Sskrll
9326fac5056SskrllFairly obviously, the file handling code of cpplib resides in the file
933e9e6e0f6Smrg'files.cc'.  It takes care of the details of file searching, opening,
9346fac5056Sskrllreading and caching, for both the main source file and all the headers
9356fac5056Sskrllit recursively includes.
9366fac5056Sskrll
9376fac5056Sskrll   The basic strategy is to minimize the number of system calls.  On
93896d60fd4Smrgmany systems, the basic 'open ()' and 'fstat ()' system calls can be
93996d60fd4Smrgquite expensive.  For every '#include'-d file, we need to try all the
9406fac5056Sskrlldirectories in the search path until we find a match.  Some projects,
9416fac5056Sskrllsuch as glibc, pass twenty or thirty include paths on the command line,
9426fac5056Sskrllso this can rapidly become time consuming.
9436fac5056Sskrll
9446fac5056Sskrll   For a header file we have not encountered before we have little
9456fac5056Sskrllchoice but to do this.  However, it is often the case that the same
9466fac5056Sskrllheaders are repeatedly included, and in these cases we try to avoid
9476fac5056Sskrllrepeating the filesystem queries whilst searching for the correct file.
9486fac5056Sskrll
9496fac5056Sskrll   For each file we try to open, we store the constructed path in a
9506fac5056Sskrllsplay tree.  This path first undergoes simplification by the function
95196d60fd4Smrg'_cpp_simplify_pathname'.  For example, '/usr/include/bits/../foo.h' is
95296d60fd4Smrgsimplified to '/usr/include/foo.h' before we enter it in the splay tree
95396d60fd4Smrgand try to 'open ()' the file.  CPP will then find subsequent uses of
95496d60fd4Smrg'foo.h', even as '/usr/include/foo.h', in the splay tree and save system
95596d60fd4Smrgcalls.
9566fac5056Sskrll
95796d60fd4Smrg   Further, it is likely the file contents have also been cached, saving
95896d60fd4Smrga 'read ()' system call.  We don't bother caching the contents of header
95996d60fd4Smrgfiles that are re-inclusion protected, and whose re-inclusion macro is
96096d60fd4Smrgdefined when we leave the header file for the first time.  If the host
96196d60fd4Smrgsupports it, we try to map suitably large files into memory, rather than
96296d60fd4Smrgreading them in directly.
9636fac5056Sskrll
9646fac5056Sskrll   The include paths are internally stored on a null-terminated
96596d60fd4Smrgsingly-linked list, starting with the '"header.h"' directory search
96696d60fd4Smrgchain, which then links into the '<header.h>' directory chain.
9676fac5056Sskrll
96896d60fd4Smrg   Files included with the '<foo.h>' syntax start the lookup directly in
96996d60fd4Smrgthe second half of this chain.  However, files included with the
97096d60fd4Smrg'"foo.h"' syntax start at the beginning of the chain, but with one extra
97196d60fd4Smrgdirectory prepended.  This is the directory of the current file; the one
97296d60fd4Smrgcontaining the '#include' directive.  Prepending this directory on a
97396d60fd4Smrgper-file basis is handled by the function 'search_from'.
9746fac5056Sskrll
9756fac5056Sskrll   Note that a header included with a directory component, such as
97696d60fd4Smrg'#include "mydir/foo.h"' and opened as '/usr/local/include/mydir/foo.h',
97796d60fd4Smrgwill have the complete path minus the basename 'foo.h' as the current
97896d60fd4Smrgdirectory.
9796fac5056Sskrll
9806fac5056Sskrll   Enough information is stored in the splay tree that CPP can
9816fac5056Sskrllimmediately tell whether it can skip the header file because of the
98296d60fd4Smrgmultiple include optimization, whether the file didn't exist or couldn't
98396d60fd4Smrgbe opened for some reason, or whether the header was flagged not to be
98496d60fd4Smrgre-used, as it is with the obsolete '#import' directive.
9856fac5056Sskrll
9866fac5056Sskrll   For the benefit of MS-DOS filesystems with an 8.3 filename
9876fac5056Sskrlllimitation, CPP offers the ability to treat various include file names
9886fac5056Sskrllas aliases for the real header files with shorter names.  The map from
98996d60fd4Smrgone to the other is found in a special file called 'header.gcc', stored
9906fac5056Sskrllin the command line (or system) include directories to which the mapping
9916fac5056Sskrllapplies.  This may be higher up the directory tree than the full path to
9926fac5056Sskrllthe file minus the base name.
9936fac5056Sskrll
9946fac5056Sskrll
9956fac5056SskrllFile: cppinternals.info,  Node: Concept Index,  Prev: Files,  Up: Top
9966fac5056Sskrll
9976fac5056SskrllConcept Index
9986fac5056Sskrll*************
9996fac5056Sskrll
10006fac5056Sskrll�[index�]
10016fac5056Sskrll* Menu:
10026fac5056Sskrll
10036fac5056Sskrll* assertions:                            Hash Nodes.          (line   6)
10046fac5056Sskrll* controlling macros:                    Guard Macros.        (line   6)
100596d60fd4Smrg* escaped newlines:                      Lexer.               (line   5)
10066fac5056Sskrll* files:                                 Files.               (line   6)
10076fac5056Sskrll* guard macros:                          Guard Macros.        (line   6)
10086fac5056Sskrll* hash table:                            Hash Nodes.          (line   6)
10096fac5056Sskrll* header files:                          Conventions.         (line   6)
10106fac5056Sskrll* identifiers:                           Hash Nodes.          (line   6)
10116fac5056Sskrll* interface:                             Conventions.         (line   6)
10126fac5056Sskrll* lexer:                                 Lexer.               (line   6)
101396d60fd4Smrg* line numbers:                          Line Numbering.      (line   5)
10146fac5056Sskrll* macro expansion:                       Macro Expansion.     (line   6)
10156fac5056Sskrll* macro representation (internal):       Macro Expansion.     (line  19)
10166fac5056Sskrll* macros:                                Hash Nodes.          (line   6)
10176fac5056Sskrll* multiple-include optimization:         Guard Macros.        (line   6)
10186fac5056Sskrll* named operators:                       Hash Nodes.          (line   6)
10196fac5056Sskrll* newlines:                              Lexer.               (line   6)
10206fac5056Sskrll* paste avoidance:                       Token Spacing.       (line   6)
10216fac5056Sskrll* spacing:                               Token Spacing.       (line   6)
102296d60fd4Smrg* token run:                             Lexer.               (line 191)
10236fac5056Sskrll* token spacing:                         Token Spacing.       (line   6)
10246fac5056Sskrll
10256fac5056Sskrll
10266fac5056Sskrll
10276fac5056SskrllTag Table:
102896d60fd4SmrgNode: Top905
102996d60fd4SmrgNode: Conventions2743
103096d60fd4SmrgNode: Lexer3685
1031e9e6e0f6SmrgRef: Invalid identifiers11600
1032e9e6e0f6SmrgRef: Lexing a line13550
1033e9e6e0f6SmrgNode: Hash Nodes18319
1034e9e6e0f6SmrgNode: Macro Expansion21198
1035e9e6e0f6SmrgNode: Token Spacing30142
1036e9e6e0f6SmrgNode: Line Numbering35998
1037e9e6e0f6SmrgNode: Guard Macros40084
1038e9e6e0f6SmrgNode: Files44875
1039e9e6e0f6SmrgNode: Concept Index48342
10406fac5056Sskrll
10416fac5056SskrllEnd Tag Table
1042*4fe0f936Smrg
1043*4fe0f936Smrg
1044*4fe0f936SmrgLocal Variables:
1045*4fe0f936Smrgcoding: utf-8
1046*4fe0f936SmrgEnd:
1047