xref: /dflybsd-src/contrib/gcc-4.7/gcc/doc/cppinternals.texi (revision 04febcfb30580676d3e95f58a16c5137ee478b32)
1*e4b17023SJohn Marino\input texinfo
2*e4b17023SJohn Marino@setfilename cppinternals.info
3*e4b17023SJohn Marino@settitle The GNU C Preprocessor Internals
4*e4b17023SJohn Marino
5*e4b17023SJohn Marino@include gcc-common.texi
6*e4b17023SJohn Marino
7*e4b17023SJohn Marino@ifinfo
8*e4b17023SJohn Marino@dircategory Software development
9*e4b17023SJohn Marino@direntry
10*e4b17023SJohn Marino* Cpplib: (cppinternals).      Cpplib internals.
11*e4b17023SJohn Marino@end direntry
12*e4b17023SJohn Marino@end ifinfo
13*e4b17023SJohn Marino
14*e4b17023SJohn Marino@c @smallbook
15*e4b17023SJohn Marino@c @cropmarks
16*e4b17023SJohn Marino@c @finalout
17*e4b17023SJohn Marino@setchapternewpage odd
18*e4b17023SJohn Marino@ifinfo
19*e4b17023SJohn MarinoThis file documents the internals of the GNU C Preprocessor.
20*e4b17023SJohn Marino
21*e4b17023SJohn MarinoCopyright 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software
22*e4b17023SJohn MarinoFoundation, Inc.
23*e4b17023SJohn Marino
24*e4b17023SJohn MarinoPermission is granted to make and distribute verbatim copies of
25*e4b17023SJohn Marinothis manual provided the copyright notice and this permission notice
26*e4b17023SJohn Marinoare preserved on all copies.
27*e4b17023SJohn Marino
28*e4b17023SJohn Marino@ignore
29*e4b17023SJohn MarinoPermission is granted to process this file through Tex and print the
30*e4b17023SJohn Marinoresults, provided the printed document carries copying permission
31*e4b17023SJohn Marinonotice identical to this one except for the removal of this paragraph
32*e4b17023SJohn Marino(this paragraph not being relevant to the printed manual).
33*e4b17023SJohn Marino
34*e4b17023SJohn Marino@end ignore
35*e4b17023SJohn MarinoPermission is granted to copy and distribute modified versions of this
36*e4b17023SJohn Marinomanual under the conditions for verbatim copying, provided also that
37*e4b17023SJohn Marinothe entire resulting derived work is distributed under the terms of a
38*e4b17023SJohn Marinopermission notice identical to this one.
39*e4b17023SJohn Marino
40*e4b17023SJohn MarinoPermission is granted to copy and distribute translations of this manual
41*e4b17023SJohn Marinointo another language, under the above conditions for modified versions.
42*e4b17023SJohn Marino@end ifinfo
43*e4b17023SJohn Marino
44*e4b17023SJohn Marino@titlepage
45*e4b17023SJohn Marino@title Cpplib Internals
46*e4b17023SJohn Marino@versionsubtitle
47*e4b17023SJohn Marino@author Neil Booth
48*e4b17023SJohn Marino@page
49*e4b17023SJohn Marino@vskip 0pt plus 1filll
50*e4b17023SJohn Marino@c man begin COPYRIGHT
51*e4b17023SJohn MarinoCopyright @copyright{} 2000, 2001, 2002, 2004, 2005
52*e4b17023SJohn MarinoFree Software Foundation, Inc.
53*e4b17023SJohn Marino
54*e4b17023SJohn MarinoPermission is granted to make and distribute verbatim copies of
55*e4b17023SJohn Marinothis manual provided the copyright notice and this permission notice
56*e4b17023SJohn Marinoare preserved on all copies.
57*e4b17023SJohn Marino
58*e4b17023SJohn MarinoPermission is granted to copy and distribute modified versions of this
59*e4b17023SJohn Marinomanual under the conditions for verbatim copying, provided also that
60*e4b17023SJohn Marinothe entire resulting derived work is distributed under the terms of a
61*e4b17023SJohn Marinopermission notice identical to this one.
62*e4b17023SJohn Marino
63*e4b17023SJohn MarinoPermission is granted to copy and distribute translations of this manual
64*e4b17023SJohn Marinointo another language, under the above conditions for modified versions.
65*e4b17023SJohn Marino@c man end
66*e4b17023SJohn Marino@end titlepage
67*e4b17023SJohn Marino@contents
68*e4b17023SJohn Marino@page
69*e4b17023SJohn Marino
70*e4b17023SJohn Marino@ifnottex
71*e4b17023SJohn Marino@node Top
72*e4b17023SJohn Marino@top
73*e4b17023SJohn Marino@chapter Cpplib---the GNU C Preprocessor
74*e4b17023SJohn Marino
75*e4b17023SJohn MarinoThe GNU C preprocessor is
76*e4b17023SJohn Marinoimplemented as a library, @dfn{cpplib}, so it can be easily shared between
77*e4b17023SJohn Marinoa stand-alone preprocessor, and a preprocessor integrated with the C,
78*e4b17023SJohn MarinoC++ and Objective-C front ends.  It is also available for use by other
79*e4b17023SJohn Marinoprograms, though this is not recommended as its exposed interface has
80*e4b17023SJohn Marinonot yet reached a point of reasonable stability.
81*e4b17023SJohn Marino
82*e4b17023SJohn MarinoThe library has been written to be re-entrant, so that it can be used
83*e4b17023SJohn Marinoto preprocess many files simultaneously if necessary.  It has also been
84*e4b17023SJohn Marinowritten with the preprocessing token as the fundamental unit; the
85*e4b17023SJohn Marinopreprocessor in previous versions of GCC would operate on text strings
86*e4b17023SJohn Marinoas the fundamental unit.
87*e4b17023SJohn Marino
88*e4b17023SJohn MarinoThis brief manual documents the internals of cpplib, and explains some
89*e4b17023SJohn Marinoof the tricky issues.  It is intended that, along with the comments in
90*e4b17023SJohn Marinothe source code, a reasonably competent C programmer should be able to
91*e4b17023SJohn Marinofigure out what the code is doing, and why things have been implemented
92*e4b17023SJohn Marinothe way they have.
93*e4b17023SJohn Marino
94*e4b17023SJohn Marino@menu
95*e4b17023SJohn Marino* Conventions::         Conventions used in the code.
96*e4b17023SJohn Marino* Lexer::               The combined C, C++ and Objective-C Lexer.
97*e4b17023SJohn Marino* Hash Nodes::          All identifiers are entered into a hash table.
98*e4b17023SJohn Marino* Macro Expansion::     Macro expansion algorithm.
99*e4b17023SJohn Marino* Token Spacing::       Spacing and paste avoidance issues.
100*e4b17023SJohn Marino* Line Numbering::      Tracking location within files.
101*e4b17023SJohn Marino* Guard Macros::        Optimizing header files with guard macros.
102*e4b17023SJohn Marino* Files::               File handling.
103*e4b17023SJohn Marino* Concept Index::       Index.
104*e4b17023SJohn Marino@end menu
105*e4b17023SJohn Marino@end ifnottex
106*e4b17023SJohn Marino
107*e4b17023SJohn Marino@node Conventions
108*e4b17023SJohn Marino@unnumbered Conventions
109*e4b17023SJohn Marino@cindex interface
110*e4b17023SJohn Marino@cindex header files
111*e4b17023SJohn Marino
112*e4b17023SJohn Marinocpplib has two interfaces---one is exposed internally only, and the
113*e4b17023SJohn Marinoother is for both internal and external use.
114*e4b17023SJohn Marino
115*e4b17023SJohn MarinoThe convention is that functions and types that are exposed to multiple
116*e4b17023SJohn Marinofiles internally are prefixed with @samp{_cpp_}, and are to be found in
117*e4b17023SJohn Marinothe file @file{internal.h}.  Functions and types exposed to external
118*e4b17023SJohn Marinoclients are in @file{cpplib.h}, and prefixed with @samp{cpp_}.  For
119*e4b17023SJohn Marinohistorical reasons this is no longer quite true, but we should strive to
120*e4b17023SJohn Marinostick to it.
121*e4b17023SJohn Marino
122*e4b17023SJohn MarinoWe are striving to reduce the information exposed in @file{cpplib.h} to the
123*e4b17023SJohn Marinobare minimum necessary, and then to keep it there.  This makes clear
124*e4b17023SJohn Marinoexactly what external clients are entitled to assume, and allows us to
125*e4b17023SJohn Marinochange internals in the future without worrying whether library clients
126*e4b17023SJohn Marinoare perhaps relying on some kind of undocumented implementation-specific
127*e4b17023SJohn Marinobehavior.
128*e4b17023SJohn Marino
129*e4b17023SJohn Marino@node Lexer
130*e4b17023SJohn Marino@unnumbered The Lexer
131*e4b17023SJohn Marino@cindex lexer
132*e4b17023SJohn Marino@cindex newlines
133*e4b17023SJohn Marino@cindex escaped newlines
134*e4b17023SJohn Marino
135*e4b17023SJohn Marino@section Overview
136*e4b17023SJohn MarinoThe lexer is contained in the file @file{lex.c}.  It is a hand-coded
137*e4b17023SJohn Marinolexer, and not implemented as a state machine.  It can understand C, C++
138*e4b17023SJohn Marinoand Objective-C source code, and has been extended to allow reasonably
139*e4b17023SJohn Marinosuccessful preprocessing of assembly language.  The lexer does not make
140*e4b17023SJohn Marinoan initial pass to strip out trigraphs and escaped newlines, but handles
141*e4b17023SJohn Marinothem as they are encountered in a single pass of the input file.  It
142*e4b17023SJohn Marinoreturns preprocessing tokens individually, not a line at a time.
143*e4b17023SJohn Marino
144*e4b17023SJohn MarinoIt is mostly transparent to users of the library, since the library's
145*e4b17023SJohn Marinointerface for obtaining the next token, @code{cpp_get_token}, takes care
146*e4b17023SJohn Marinoof lexing new tokens, handling directives, and expanding macros as
147*e4b17023SJohn Marinonecessary.  However, the lexer does expose some functionality so that
148*e4b17023SJohn Marinoclients of the library can easily spell a given token, such as
149*e4b17023SJohn Marino@code{cpp_spell_token} and @code{cpp_token_len}.  These functions are
150*e4b17023SJohn Marinouseful when generating diagnostics, and for emitting the preprocessed
151*e4b17023SJohn Marinooutput.
152*e4b17023SJohn Marino
153*e4b17023SJohn Marino@section Lexing a token
154*e4b17023SJohn MarinoLexing of an individual token is handled by @code{_cpp_lex_direct} and
155*e4b17023SJohn Marinoits subroutines.  In its current form the code is quite complicated,
156*e4b17023SJohn Marinowith read ahead characters and such-like, since it strives to not step
157*e4b17023SJohn Marinoback in the character stream in preparation for handling non-ASCII file
158*e4b17023SJohn Marinoencodings.  The current plan is to convert any such files to UTF-8
159*e4b17023SJohn Marinobefore processing them.  This complexity is therefore unnecessary and
160*e4b17023SJohn Marinowill be removed, so I'll not discuss it further here.
161*e4b17023SJohn Marino
162*e4b17023SJohn MarinoThe job of @code{_cpp_lex_direct} is simply to lex a token.  It is not
163*e4b17023SJohn Marinoresponsible for issues like directive handling, returning lookahead
164*e4b17023SJohn Marinotokens directly, multiple-include optimization, or conditional block
165*e4b17023SJohn Marinoskipping.  It necessarily has a minor r@^ole to play in memory
166*e4b17023SJohn Marinomanagement of lexed lines.  I discuss these issues in a separate section
167*e4b17023SJohn Marino(@pxref{Lexing a line}).
168*e4b17023SJohn Marino
169*e4b17023SJohn MarinoThe lexer places the token it lexes into storage pointed to by the
170*e4b17023SJohn Marinovariable @code{cur_token}, and then increments it.  This variable is
171*e4b17023SJohn Marinoimportant for correct diagnostic positioning.  Unless a specific line
172*e4b17023SJohn Marinoand column are passed to the diagnostic routines, they will examine the
173*e4b17023SJohn Marino@code{line} and @code{col} values of the token just before the location
174*e4b17023SJohn Marinothat @code{cur_token} points to, and use that location to report the
175*e4b17023SJohn Marinodiagnostic.
176*e4b17023SJohn Marino
177*e4b17023SJohn MarinoThe lexer does not consider whitespace to be a token in its own right.
178*e4b17023SJohn MarinoIf whitespace (other than a new line) precedes a token, it sets the
179*e4b17023SJohn Marino@code{PREV_WHITE} bit in the token's flags.  Each token has its
180*e4b17023SJohn Marino@code{line} and @code{col} variables set to the line and column of the
181*e4b17023SJohn Marinofirst character of the token.  This line number is the line number in
182*e4b17023SJohn Marinothe translation unit, and can be converted to a source (file, line) pair
183*e4b17023SJohn Marinousing the line map code.
184*e4b17023SJohn Marino
185*e4b17023SJohn MarinoThe first token on a logical, i.e.@: unescaped, line has the flag
186*e4b17023SJohn Marino@code{BOL} set for beginning-of-line.  This flag is intended for
187*e4b17023SJohn Marinointernal use, both to distinguish a @samp{#} that begins a directive
188*e4b17023SJohn Marinofrom one that doesn't, and to generate a call-back to clients that want
189*e4b17023SJohn Marinoto be notified about the start of every non-directive line with tokens
190*e4b17023SJohn Marinoon it.  Clients cannot reliably determine this for themselves: the first
191*e4b17023SJohn Marinotoken might be a macro, and the tokens of a macro expansion do not have
192*e4b17023SJohn Marinothe @code{BOL} flag set.  The macro expansion may even be empty, and the
193*e4b17023SJohn Marinonext token on the line certainly won't have the @code{BOL} flag set.
194*e4b17023SJohn Marino
195*e4b17023SJohn MarinoNew lines are treated specially; exactly how the lexer handles them is
196*e4b17023SJohn Marinocontext-dependent.  The C standard mandates that directives are
197*e4b17023SJohn Marinoterminated by the first unescaped newline character, even if it appears
198*e4b17023SJohn Marinoin the middle of a macro expansion.  Therefore, if the state variable
199*e4b17023SJohn Marino@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
200*e4b17023SJohn Marinowhich is normally used to indicate end-of-file, to indicate
201*e4b17023SJohn Marinoend-of-directive.  In a directive a @code{CPP_EOF} token never means
202*e4b17023SJohn Marinoend-of-file.  Conveniently, if the caller was @code{collect_args}, it
203*e4b17023SJohn Marinoalready handles @code{CPP_EOF} as if it were end-of-file, and reports an
204*e4b17023SJohn Marinoerror about an unterminated macro argument list.
205*e4b17023SJohn Marino
206*e4b17023SJohn MarinoThe C standard also specifies that a new line in the middle of the
207*e4b17023SJohn Marinoarguments to a macro is treated as whitespace.  This white space is
208*e4b17023SJohn Marinoimportant in case the macro argument is stringified.  The state variable
209*e4b17023SJohn Marino@code{parsing_args} is nonzero when the preprocessor is collecting the
210*e4b17023SJohn Marinoarguments to a macro call.  It is set to 1 when looking for the opening
211*e4b17023SJohn Marinoparenthesis to a function-like macro, and 2 when collecting the actual
212*e4b17023SJohn Marinoarguments up to the closing parenthesis, since these two cases need to
213*e4b17023SJohn Marinobe distinguished sometimes.  One such time is here: the lexer sets the
214*e4b17023SJohn Marino@code{PREV_WHITE} flag of a token if it meets a new line when
215*e4b17023SJohn Marino@code{parsing_args} is set to 2.  It doesn't set it if it meets a new
216*e4b17023SJohn Marinoline when @code{parsing_args} is 1, since then code like
217*e4b17023SJohn Marino
218*e4b17023SJohn Marino@smallexample
219*e4b17023SJohn Marino#define foo() bar
220*e4b17023SJohn Marinofoo
221*e4b17023SJohn Marinobaz
222*e4b17023SJohn Marino@end smallexample
223*e4b17023SJohn Marino
224*e4b17023SJohn Marino@noindent would be output with an erroneous space before @samp{baz}:
225*e4b17023SJohn Marino
226*e4b17023SJohn Marino@smallexample
227*e4b17023SJohn Marinofoo
228*e4b17023SJohn Marino baz
229*e4b17023SJohn Marino@end smallexample
230*e4b17023SJohn Marino
231*e4b17023SJohn MarinoThis is a good example of the subtlety of getting token spacing correct
232*e4b17023SJohn Marinoin the preprocessor; there are plenty of tests in the testsuite for
233*e4b17023SJohn Marinocorner cases like this.
234*e4b17023SJohn Marino
235*e4b17023SJohn MarinoThe lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
236*e4b17023SJohn Marinoand @samp{\n\r} as a single new line indicator.  This allows it to
237*e4b17023SJohn Marinotransparently preprocess MS-DOS, Macintosh and Unix files without their
238*e4b17023SJohn Marinoneeding to pass through a special filter beforehand.
239*e4b17023SJohn Marino
240*e4b17023SJohn MarinoWe also decided to treat a backslash, either @samp{\} or the trigraph
241*e4b17023SJohn Marino@samp{??/}, separated from one of the above newline indicators by
242*e4b17023SJohn Marinonon-comment whitespace only, as intending to escape the newline.  It
243*e4b17023SJohn Marinotends to be a typing mistake, and cannot reasonably be mistaken for
244*e4b17023SJohn Marinoanything else in any of the C-family grammars.  Since handling it this
245*e4b17023SJohn Marinoway is not strictly conforming to the ISO standard, the library issues a
246*e4b17023SJohn Marinowarning wherever it encounters it.
247*e4b17023SJohn Marino
248*e4b17023SJohn MarinoHandling newlines like this is made simpler by doing it in one place
249*e4b17023SJohn Marinoonly.  The function @code{handle_newline} takes care of all newline
250*e4b17023SJohn Marinocharacters, and @code{skip_escaped_newlines} takes care of arbitrarily
251*e4b17023SJohn Marinolong sequences of escaped newlines, deferring to @code{handle_newline}
252*e4b17023SJohn Marinoto handle the newlines themselves.
253*e4b17023SJohn Marino
254*e4b17023SJohn MarinoThe most painful aspect of lexing ISO-standard C and C++ is handling
255*e4b17023SJohn Marinotrigraphs and backlash-escaped newlines.  Trigraphs are processed before
256*e4b17023SJohn Marinoany interpretation of the meaning of a character is made, and unfortunately
257*e4b17023SJohn Marinothere is a trigraph representation for a backslash, so it is possible for
258*e4b17023SJohn Marinothe trigraph @samp{??/} to introduce an escaped newline.
259*e4b17023SJohn Marino
260*e4b17023SJohn MarinoEscaped newlines are tedious because theoretically they can occur
261*e4b17023SJohn Marinoanywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,
262*e4b17023SJohn Marinowithin the characters of an identifier, and even between the @samp{*}
263*e4b17023SJohn Marinoand @samp{/} that terminates a comment.  Moreover, you cannot be sure
264*e4b17023SJohn Marinothere is just one---there might be an arbitrarily long sequence of them.
265*e4b17023SJohn Marino
266*e4b17023SJohn MarinoSo, for example, the routine that lexes a number, @code{parse_number},
267*e4b17023SJohn Marinocannot assume that it can scan forwards until the first non-number
268*e4b17023SJohn Marinocharacter and be done with it, because this could be the @samp{\}
269*e4b17023SJohn Marinointroducing an escaped newline, or the @samp{?} introducing the trigraph
270*e4b17023SJohn Marinosequence that represents the @samp{\} of an escaped newline.  If it
271*e4b17023SJohn Marinoencounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
272*e4b17023SJohn Marinoto skip over any potential escaped newlines before checking whether the
273*e4b17023SJohn Marinonumber has been finished.
274*e4b17023SJohn Marino
275*e4b17023SJohn MarinoSimilarly code in the main body of @code{_cpp_lex_direct} cannot simply
276*e4b17023SJohn Marinocheck for a @samp{=} after a @samp{+} character to determine whether it
277*e4b17023SJohn Marinohas a @samp{+=} token; it needs to be prepared for an escaped newline of
278*e4b17023SJohn Marinosome sort.  Such cases use the function @code{get_effective_char}, which
279*e4b17023SJohn Marinoreturns the first character after any intervening escaped newlines.
280*e4b17023SJohn Marino
281*e4b17023SJohn MarinoThe lexer needs to keep track of the correct column position, including
282*e4b17023SJohn Marinocounting tabs as specified by the @option{-ftabstop=} option.  This
283*e4b17023SJohn Marinoshould be done even within C-style comments; they can appear in the
284*e4b17023SJohn Marinomiddle of a line, and we want to report diagnostics in the correct
285*e4b17023SJohn Marinoposition for text appearing after the end of the comment.
286*e4b17023SJohn Marino
287*e4b17023SJohn Marino@anchor{Invalid identifiers}
288*e4b17023SJohn MarinoSome identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
289*e4b17023SJohn Marinomay be invalid and require a diagnostic.  However, if they appear in a
290*e4b17023SJohn Marinomacro expansion we don't want to complain with each use of the macro.
291*e4b17023SJohn MarinoIt is therefore best to catch them during the lexing stage, in
292*e4b17023SJohn Marino@code{parse_identifier}.  In both cases, whether a diagnostic is needed
293*e4b17023SJohn Marinoor not is dependent upon the lexer's state.  For example, we don't want
294*e4b17023SJohn Marinoto issue a diagnostic for re-poisoning a poisoned identifier, or for
295*e4b17023SJohn Marinousing @code{__VA_ARGS__} in the expansion of a variable-argument macro.
296*e4b17023SJohn MarinoTherefore @code{parse_identifier} makes use of state flags to determine
297*e4b17023SJohn Marinowhether a diagnostic is appropriate.  Since we change state on a
298*e4b17023SJohn Marinoper-token basis, and don't lex whole lines at a time, this is not a
299*e4b17023SJohn Marinoproblem.
300*e4b17023SJohn Marino
301*e4b17023SJohn MarinoAnother place where state flags are used to change behavior is whilst
302*e4b17023SJohn Marinolexing header names.  Normally, a @samp{<} would be lexed as a single
303*e4b17023SJohn Marinotoken.  After a @code{#include} directive, though, it should be lexed as
304*e4b17023SJohn Marinoa single token as far as the nearest @samp{>} character.  Note that we
305*e4b17023SJohn Marinodon't allow the terminators of header names to be escaped; the first
306*e4b17023SJohn Marino@samp{"} or @samp{>} terminates the header name.
307*e4b17023SJohn Marino
308*e4b17023SJohn MarinoInterpretation of some character sequences depends upon whether we are
309*e4b17023SJohn Marinolexing C, C++ or Objective-C, and on the revision of the standard in
310*e4b17023SJohn Marinoforce.  For example, @samp{::} is a single token in C++, but in C it is
311*e4b17023SJohn Marinotwo separate @samp{:} tokens and almost certainly a syntax error.  Such
312*e4b17023SJohn Marinocases are handled by @code{_cpp_lex_direct} based upon command-line
313*e4b17023SJohn Marinoflags stored in the @code{cpp_options} structure.
314*e4b17023SJohn Marino
315*e4b17023SJohn MarinoOnce a token has been lexed, it leads an independent existence.  The
316*e4b17023SJohn Marinospelling of numbers, identifiers and strings is copied to permanent
317*e4b17023SJohn Marinostorage from the original input buffer, so a token remains valid and
318*e4b17023SJohn Marinocorrect even if its source buffer is freed with @code{_cpp_pop_buffer}.
319*e4b17023SJohn MarinoThe storage holding the spellings of such tokens remains until the
320*e4b17023SJohn Marinoclient program calls cpp_destroy, probably at the end of the translation
321*e4b17023SJohn Marinounit.
322*e4b17023SJohn Marino
323*e4b17023SJohn Marino@anchor{Lexing a line}
324*e4b17023SJohn Marino@section Lexing a line
325*e4b17023SJohn Marino@cindex token run
326*e4b17023SJohn Marino
327*e4b17023SJohn MarinoWhen the preprocessor was changed to return pointers to tokens, one
328*e4b17023SJohn Marinofeature I wanted was some sort of guarantee regarding how long a
329*e4b17023SJohn Marinoreturned pointer remains valid.  This is important to the stand-alone
330*e4b17023SJohn Marinopreprocessor, the future direction of the C family front ends, and even
331*e4b17023SJohn Marinoto cpplib itself internally.
332*e4b17023SJohn Marino
333*e4b17023SJohn MarinoOccasionally the preprocessor wants to be able to peek ahead in the
334*e4b17023SJohn Marinotoken stream.  For example, after the name of a function-like macro, it
335*e4b17023SJohn Marinowants to check the next token to see if it is an opening parenthesis.
336*e4b17023SJohn MarinoAnother example is that, after reading the first few tokens of a
337*e4b17023SJohn Marino@code{#pragma} directive and not recognizing it as a registered pragma,
338*e4b17023SJohn Marinoit wants to backtrack and allow the user-defined handler for unknown
339*e4b17023SJohn Marinopragmas to access the full @code{#pragma} token stream.  The stand-alone
340*e4b17023SJohn Marinopreprocessor wants to be able to test the current token with the
341*e4b17023SJohn Marinoprevious one to see if a space needs to be inserted to preserve their
342*e4b17023SJohn Marinoseparate tokenization upon re-lexing (paste avoidance), so it needs to
343*e4b17023SJohn Marinobe sure the pointer to the previous token is still valid.  The
344*e4b17023SJohn Marinorecursive-descent C++ parser wants to be able to perform tentative
345*e4b17023SJohn Marinoparsing arbitrarily far ahead in the token stream, and then to be able
346*e4b17023SJohn Marinoto jump back to a prior position in that stream if necessary.
347*e4b17023SJohn Marino
348*e4b17023SJohn MarinoThe rule I chose, which is fairly natural, is to arrange that the
349*e4b17023SJohn Marinopreprocessor lex all tokens on a line consecutively into a token buffer,
350*e4b17023SJohn Marinowhich I call a @dfn{token run}, and when meeting an unescaped new line
351*e4b17023SJohn Marino(newlines within comments do not count either), to start lexing back at
352*e4b17023SJohn Marinothe beginning of the run.  Note that we do @emph{not} lex a line of
353*e4b17023SJohn Marinotokens at once; if we did that @code{parse_identifier} would not have
354*e4b17023SJohn Marinostate flags available to warn about invalid identifiers (@pxref{Invalid
355*e4b17023SJohn Marinoidentifiers}).
356*e4b17023SJohn Marino
357*e4b17023SJohn MarinoIn other words, accessing tokens that appeared earlier in the current
358*e4b17023SJohn Marinoline is valid, but since each logical line overwrites the tokens of the
359*e4b17023SJohn Marinoprevious line, tokens from prior lines are unavailable.  In particular,
360*e4b17023SJohn Marinosince a directive only occupies a single logical line, this means that
361*e4b17023SJohn Marinothe directive handlers like the @code{#pragma} handler can jump around
362*e4b17023SJohn Marinoin the directive's tokens if necessary.
363*e4b17023SJohn Marino
364*e4b17023SJohn MarinoTwo issues remain: what about tokens that arise from macro expansions,
365*e4b17023SJohn Marinoand what happens when we have a long line that overflows the token run?
366*e4b17023SJohn Marino
367*e4b17023SJohn MarinoSince we promise clients that we preserve the validity of pointers that
368*e4b17023SJohn Marinowe have already returned for tokens that appeared earlier in the line,
369*e4b17023SJohn Marinowe cannot reallocate the run.  Instead, on overflow it is expanded by
370*e4b17023SJohn Marinochaining a new token run on to the end of the existing one.
371*e4b17023SJohn Marino
372*e4b17023SJohn MarinoThe tokens forming a macro's replacement list are collected by the
373*e4b17023SJohn Marino@code{#define} handler, and placed in storage that is only freed by
374*e4b17023SJohn Marino@code{cpp_destroy}.  So if a macro is expanded in the line of tokens,
375*e4b17023SJohn Marinothe pointers to the tokens of its expansion that are returned will always
376*e4b17023SJohn Marinoremain valid.  However, macros are a little trickier than that, since
377*e4b17023SJohn Marinothey give rise to three sources of fresh tokens.  They are the built-in
378*e4b17023SJohn Marinomacros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
379*e4b17023SJohn Marinofor stringification and token pasting.  I handled this by allocating
380*e4b17023SJohn Marinospace for these tokens from the lexer's token run chain.  This means
381*e4b17023SJohn Marinothey automatically receive the same lifetime guarantees as lexed tokens,
382*e4b17023SJohn Marinoand we don't need to concern ourselves with freeing them.
383*e4b17023SJohn Marino
384*e4b17023SJohn MarinoLexing into a line of tokens solves some of the token memory management
385*e4b17023SJohn Marinoissues, but not all.  The opening parenthesis after a function-like
386*e4b17023SJohn Marinomacro name might lie on a different line, and the front ends definitely
387*e4b17023SJohn Marinowant the ability to look ahead past the end of the current line.  So
388*e4b17023SJohn Marinocpplib only moves back to the start of the token run at the end of a
389*e4b17023SJohn Marinoline if the variable @code{keep_tokens} is zero.  Line-buffering is
390*e4b17023SJohn Marinoquite natural for the preprocessor, and as a result the only time cpplib
391*e4b17023SJohn Marinoneeds to increment this variable is whilst looking for the opening
392*e4b17023SJohn Marinoparenthesis to, and reading the arguments of, a function-like macro.  In
393*e4b17023SJohn Marinothe near future cpplib will export an interface to increment and
394*e4b17023SJohn Marinodecrement this variable, so that clients can share full control over the
395*e4b17023SJohn Marinolifetime of token pointers too.
396*e4b17023SJohn Marino
397*e4b17023SJohn MarinoThe routine @code{_cpp_lex_token} handles moving to new token runs,
398*e4b17023SJohn Marinocalling @code{_cpp_lex_direct} to lex new tokens, or returning
399*e4b17023SJohn Marinopreviously-lexed tokens if we stepped back in the token stream.  It also
400*e4b17023SJohn Marinochecks each token for the @code{BOL} flag, which might indicate a
401*e4b17023SJohn Marinodirective that needs to be handled, or require a start-of-line call-back
402*e4b17023SJohn Marinoto be made.  @code{_cpp_lex_token} also handles skipping over tokens in
403*e4b17023SJohn Marinofailed conditional blocks, and invalidates the control macro of the
404*e4b17023SJohn Marinomultiple-include optimization if a token was successfully lexed outside
405*e4b17023SJohn Marinoa directive.  In other words, its callers do not need to concern
406*e4b17023SJohn Marinothemselves with such issues.
407*e4b17023SJohn Marino
408*e4b17023SJohn Marino@node Hash Nodes
409*e4b17023SJohn Marino@unnumbered Hash Nodes
410*e4b17023SJohn Marino@cindex hash table
411*e4b17023SJohn Marino@cindex identifiers
412*e4b17023SJohn Marino@cindex macros
413*e4b17023SJohn Marino@cindex assertions
414*e4b17023SJohn Marino@cindex named operators
415*e4b17023SJohn Marino
416*e4b17023SJohn MarinoWhen cpplib encounters an ``identifier'', it generates a hash code for
417*e4b17023SJohn Marinoit and stores it in the hash table.  By ``identifier'' we mean tokens
418*e4b17023SJohn Marinowith type @code{CPP_NAME}; this includes identifiers in the usual C
419*e4b17023SJohn Marinosense, as well as keywords, directive names, macro names and so on.  For
420*e4b17023SJohn Marinoexample, all of @code{pragma}, @code{int}, @code{foo} and
421*e4b17023SJohn Marino@code{__GNUC__} are identifiers and hashed when lexed.
422*e4b17023SJohn Marino
423*e4b17023SJohn MarinoEach node in the hash table contain various information about the
424*e4b17023SJohn Marinoidentifier it represents.  For example, its length and type.  At any one
425*e4b17023SJohn Marinotime, each identifier falls into exactly one of three categories:
426*e4b17023SJohn Marino
427*e4b17023SJohn Marino@itemize @bullet
428*e4b17023SJohn Marino@item Macros
429*e4b17023SJohn Marino
430*e4b17023SJohn MarinoThese have been declared to be macros, either on the command line or
431*e4b17023SJohn Marinowith @code{#define}.  A few, such as @code{__TIME__} are built-ins
432*e4b17023SJohn Marinoentered in the hash table during initialization.  The hash node for a
433*e4b17023SJohn Marinonormal macro points to a structure with more information about the
434*e4b17023SJohn Marinomacro, such as whether it is function-like, how many arguments it takes,
435*e4b17023SJohn Marinoand its expansion.  Built-in macros are flagged as special, and instead
436*e4b17023SJohn Marinocontain an enum indicating which of the various built-in macros it is.
437*e4b17023SJohn Marino
438*e4b17023SJohn Marino@item Assertions
439*e4b17023SJohn Marino
440*e4b17023SJohn MarinoAssertions are in a separate namespace to macros.  To enforce this, cpp
441*e4b17023SJohn Marinoactually prepends a @code{#} character before hashing and entering it in
442*e4b17023SJohn Marinothe hash table.  An assertion's node points to a chain of answers to
443*e4b17023SJohn Marinothat assertion.
444*e4b17023SJohn Marino
445*e4b17023SJohn Marino@item Void
446*e4b17023SJohn Marino
447*e4b17023SJohn MarinoEverything else falls into this category---an identifier that is not
448*e4b17023SJohn Marinocurrently a macro, or a macro that has since been undefined with
449*e4b17023SJohn Marino@code{#undef}.
450*e4b17023SJohn Marino
451*e4b17023SJohn MarinoWhen preprocessing C++, this category also includes the named operators,
452*e4b17023SJohn Marinosuch as @code{xor}.  In expressions these behave like the operators they
453*e4b17023SJohn Marinorepresent, but in contexts where the spelling of a token matters they
454*e4b17023SJohn Marinoare spelt differently.  This spelling distinction is relevant when they
455*e4b17023SJohn Marinoare operands of the stringizing and pasting macro operators @code{#} and
456*e4b17023SJohn Marino@code{##}.  Named operator hash nodes are flagged, both to catch the
457*e4b17023SJohn Marinospelling distinction and to prevent them from being defined as macros.
458*e4b17023SJohn Marino@end itemize
459*e4b17023SJohn Marino
460*e4b17023SJohn MarinoThe same identifiers share the same hash node.  Since each identifier
461*e4b17023SJohn Marinotoken, after lexing, contains a pointer to its hash node, this is used
462*e4b17023SJohn Marinoto provide rapid lookup of various information.  For example, when
463*e4b17023SJohn Marinoparsing a @code{#define} statement, CPP flags each argument's identifier
464*e4b17023SJohn Marinohash node with the index of that argument.  This makes duplicated
465*e4b17023SJohn Marinoargument checking an O(1) operation for each argument.  Similarly, for
466*e4b17023SJohn Marinoeach identifier in the macro's expansion, lookup to see if it is an
467*e4b17023SJohn Marinoargument, and which argument it is, is also an O(1) operation.  Further,
468*e4b17023SJohn Marinoeach directive name, such as @code{endif}, has an associated directive
469*e4b17023SJohn Marinoenum stored in its hash node, so that directive lookup is also O(1).
470*e4b17023SJohn Marino
471*e4b17023SJohn Marino@node Macro Expansion
472*e4b17023SJohn Marino@unnumbered Macro Expansion Algorithm
473*e4b17023SJohn Marino@cindex macro expansion
474*e4b17023SJohn Marino
475*e4b17023SJohn MarinoMacro expansion is a tricky operation, fraught with nasty corner cases
476*e4b17023SJohn Marinoand situations that render what you thought was a nifty way to
477*e4b17023SJohn Marinooptimize the preprocessor's expansion algorithm wrong in quite subtle
478*e4b17023SJohn Marinoways.
479*e4b17023SJohn Marino
480*e4b17023SJohn MarinoI strongly recommend you have a good grasp of how the C and C++
481*e4b17023SJohn Marinostandards require macros to be expanded before diving into this
482*e4b17023SJohn Marinosection, let alone the code!.  If you don't have a clear mental
483*e4b17023SJohn Marinopicture of how things like nested macro expansion, stringification and
484*e4b17023SJohn Marinotoken pasting are supposed to work, damage to your sanity can quickly
485*e4b17023SJohn Marinoresult.
486*e4b17023SJohn Marino
487*e4b17023SJohn Marino@section Internal representation of macros
488*e4b17023SJohn Marino@cindex macro representation (internal)
489*e4b17023SJohn Marino
490*e4b17023SJohn MarinoThe preprocessor stores macro expansions in tokenized form.  This
491*e4b17023SJohn Marinosaves repeated lexing passes during expansion, at the cost of a small
492*e4b17023SJohn Marinoincrease in memory consumption on average.  The tokens are stored
493*e4b17023SJohn Marinocontiguously in memory, so a pointer to the first one and a token
494*e4b17023SJohn Marinocount is all you need to get the replacement list of a macro.
495*e4b17023SJohn Marino
496*e4b17023SJohn MarinoIf the macro is a function-like macro the preprocessor also stores its
497*e4b17023SJohn Marinoparameters, in the form of an ordered list of pointers to the hash
498*e4b17023SJohn Marinotable entry of each parameter's identifier.  Further, in the macro's
499*e4b17023SJohn Marinostored expansion each occurrence of a parameter is replaced with a
500*e4b17023SJohn Marinospecial token of type @code{CPP_MACRO_ARG}.  Each such token holds the
501*e4b17023SJohn Marinoindex of the parameter it represents in the parameter list, which
502*e4b17023SJohn Marinoallows rapid replacement of parameters with their arguments during
503*e4b17023SJohn Marinoexpansion.  Despite this optimization it is still necessary to store
504*e4b17023SJohn Marinothe original parameters to the macro, both for dumping with e.g.,
505*e4b17023SJohn Marino@option{-dD}, and to warn about non-trivial macro redefinitions when
506*e4b17023SJohn Marinothe parameter names have changed.
507*e4b17023SJohn Marino
508*e4b17023SJohn Marino@section Macro expansion overview
509*e4b17023SJohn MarinoThe preprocessor maintains a @dfn{context stack}, implemented as a
510*e4b17023SJohn Marinolinked list of @code{cpp_context} structures, which together represent
511*e4b17023SJohn Marinothe macro expansion state at any one time.  The @code{struct
512*e4b17023SJohn Marinocpp_reader} member variable @code{context} points to the current top
513*e4b17023SJohn Marinoof this stack.  The top normally holds the unexpanded replacement list
514*e4b17023SJohn Marinoof the innermost macro under expansion, except when cpplib is about to
515*e4b17023SJohn Marinopre-expand an argument, in which case it holds that argument's
516*e4b17023SJohn Marinounexpanded tokens.
517*e4b17023SJohn Marino
518*e4b17023SJohn MarinoWhen there are no macros under expansion, cpplib is in @dfn{base
519*e4b17023SJohn Marinocontext}.  All contexts other than the base context contain a
520*e4b17023SJohn Marinocontiguous list of tokens delimited by a starting and ending token.
521*e4b17023SJohn MarinoWhen not in base context, cpplib obtains the next token from the list
522*e4b17023SJohn Marinoof the top context.  If there are no tokens left in the list, it pops
523*e4b17023SJohn Marinothat context off the stack, and subsequent ones if necessary, until an
524*e4b17023SJohn Marinounexhausted context is found or it returns to base context.  In base
525*e4b17023SJohn Marinocontext, cpplib reads tokens directly from the lexer.
526*e4b17023SJohn Marino
527*e4b17023SJohn MarinoIf it encounters an identifier that is both a macro and enabled for
528*e4b17023SJohn Marinoexpansion, cpplib prepares to push a new context for that macro on the
529*e4b17023SJohn Marinostack by calling the routine @code{enter_macro_context}.  When this
530*e4b17023SJohn Marinoroutine returns, the new context will contain the unexpanded tokens of
531*e4b17023SJohn Marinothe replacement list of that macro.  In the case of function-like
532*e4b17023SJohn Marinomacros, @code{enter_macro_context} also replaces any parameters in the
533*e4b17023SJohn Marinoreplacement list, stored as @code{CPP_MACRO_ARG} tokens, with the
534*e4b17023SJohn Marinoappropriate macro argument.  If the standard requires that the
535*e4b17023SJohn Marinoparameter be replaced with its expanded argument, the argument will
536*e4b17023SJohn Marinohave been fully macro expanded first.
537*e4b17023SJohn Marino
538*e4b17023SJohn Marino@code{enter_macro_context} also handles special macros like
539*e4b17023SJohn Marino@code{__LINE__}.  Although these macros expand to a single token which
540*e4b17023SJohn Marinocannot contain any further macros, for reasons of token spacing
541*e4b17023SJohn Marino(@pxref{Token Spacing}) and simplicity of implementation, cpplib
542*e4b17023SJohn Marinohandles these special macros by pushing a context containing just that
543*e4b17023SJohn Marinoone token.
544*e4b17023SJohn Marino
545*e4b17023SJohn MarinoThe final thing that @code{enter_macro_context} does before returning
546*e4b17023SJohn Marinois to mark the macro disabled for expansion (except for special macros
547*e4b17023SJohn Marinolike @code{__TIME__}).  The macro is re-enabled when its context is
548*e4b17023SJohn Marinolater popped from the context stack, as described above.  This strict
549*e4b17023SJohn Marinoordering ensures that a macro is disabled whilst its expansion is
550*e4b17023SJohn Marinobeing scanned, but that it is @emph{not} disabled whilst any arguments
551*e4b17023SJohn Marinoto it are being expanded.
552*e4b17023SJohn Marino
553*e4b17023SJohn Marino@section Scanning the replacement list for macros to expand
554*e4b17023SJohn MarinoThe C standard states that, after any parameters have been replaced
555*e4b17023SJohn Marinowith their possibly-expanded arguments, the replacement list is
556*e4b17023SJohn Marinoscanned for nested macros.  Further, any identifiers in the
557*e4b17023SJohn Marinoreplacement list that are not expanded during this scan are never
558*e4b17023SJohn Marinoagain eligible for expansion in the future, if the reason they were
559*e4b17023SJohn Marinonot expanded is that the macro in question was disabled.
560*e4b17023SJohn Marino
561*e4b17023SJohn MarinoClearly this latter condition can only apply to tokens resulting from
562*e4b17023SJohn Marinoargument pre-expansion.  Other tokens never have an opportunity to be
563*e4b17023SJohn Marinore-tested for expansion.  It is possible for identifiers that are
564*e4b17023SJohn Marinofunction-like macros to not expand initially but to expand during a
565*e4b17023SJohn Marinolater scan.  This occurs when the identifier is the last token of an
566*e4b17023SJohn Marinoargument (and therefore originally followed by a comma or a closing
567*e4b17023SJohn Marinoparenthesis in its macro's argument list), and when it replaces its
568*e4b17023SJohn Marinoparameter in the macro's replacement list, the subsequent token
569*e4b17023SJohn Marinohappens to be an opening parenthesis (itself possibly the first token
570*e4b17023SJohn Marinoof an argument).
571*e4b17023SJohn Marino
572*e4b17023SJohn MarinoIt is important to note that when cpplib reads the last token of a
573*e4b17023SJohn Marinogiven context, that context still remains on the stack.  Only when
574*e4b17023SJohn Marinolooking for the @emph{next} token do we pop it off the stack and drop
575*e4b17023SJohn Marinoto a lower context.  This makes backing up by one token easy, but more
576*e4b17023SJohn Marinoimportantly ensures that the macro corresponding to the current
577*e4b17023SJohn Marinocontext is still disabled when we are considering the last token of
578*e4b17023SJohn Marinoits replacement list for expansion (or indeed expanding it).  As an
579*e4b17023SJohn Marinoexample, which illustrates many of the points above, consider
580*e4b17023SJohn Marino
581*e4b17023SJohn Marino@smallexample
582*e4b17023SJohn Marino#define foo(x) bar x
583*e4b17023SJohn Marinofoo(foo) (2)
584*e4b17023SJohn Marino@end smallexample
585*e4b17023SJohn Marino
586*e4b17023SJohn Marino@noindent which fully expands to @samp{bar foo (2)}.  During pre-expansion
587*e4b17023SJohn Marinoof the argument, @samp{foo} does not expand even though the macro is
588*e4b17023SJohn Marinoenabled, since it has no following parenthesis [pre-expansion of an
589*e4b17023SJohn Marinoargument only uses tokens from that argument; it cannot take tokens
590*e4b17023SJohn Marinofrom whatever follows the macro invocation].  This still leaves the
591*e4b17023SJohn Marinoargument token @samp{foo} eligible for future expansion.  Then, when
592*e4b17023SJohn Marinore-scanning after argument replacement, the token @samp{foo} is
593*e4b17023SJohn Marinorejected for expansion, and marked ineligible for future expansion,
594*e4b17023SJohn Marinosince the macro is now disabled.  It is disabled because the
595*e4b17023SJohn Marinoreplacement list @samp{bar foo} of the macro is still on the context
596*e4b17023SJohn Marinostack.
597*e4b17023SJohn Marino
598*e4b17023SJohn MarinoIf instead the algorithm looked for an opening parenthesis first and
599*e4b17023SJohn Marinothen tested whether the macro were disabled it would be subtly wrong.
600*e4b17023SJohn MarinoIn the example above, the replacement list of @samp{foo} would be
601*e4b17023SJohn Marinopopped in the process of finding the parenthesis, re-enabling
602*e4b17023SJohn Marino@samp{foo} and expanding it a second time.
603*e4b17023SJohn Marino
604*e4b17023SJohn Marino@section Looking for a function-like macro's opening parenthesis
605*e4b17023SJohn MarinoFunction-like macros only expand when immediately followed by a
606*e4b17023SJohn Marinoparenthesis.  To do this cpplib needs to temporarily disable macros
607*e4b17023SJohn Marinoand read the next token.  Unfortunately, because of spacing issues
608*e4b17023SJohn Marino(@pxref{Token Spacing}), there can be fake padding tokens in-between,
609*e4b17023SJohn Marinoand if the next real token is not a parenthesis cpplib needs to be
610*e4b17023SJohn Marinoable to back up that one token as well as retain the information in
611*e4b17023SJohn Marinoany intervening padding tokens.
612*e4b17023SJohn Marino
613*e4b17023SJohn MarinoBacking up more than one token when macros are involved is not
614*e4b17023SJohn Marinopermitted by cpplib, because in general it might involve issues like
615*e4b17023SJohn Marinorestoring popped contexts onto the context stack, which are too hard.
616*e4b17023SJohn MarinoInstead, searching for the parenthesis is handled by a special
617*e4b17023SJohn Marinofunction, @code{funlike_invocation_p}, which remembers padding
618*e4b17023SJohn Marinoinformation as it reads tokens.  If the next real token is not an
619*e4b17023SJohn Marinoopening parenthesis, it backs up that one token, and then pushes an
620*e4b17023SJohn Marinoextra context just containing the padding information if necessary.
621*e4b17023SJohn Marino
622*e4b17023SJohn Marino@section Marking tokens ineligible for future expansion
623*e4b17023SJohn MarinoAs discussed above, cpplib needs a way of marking tokens as
624*e4b17023SJohn Marinounexpandable.  Since the tokens cpplib handles are read-only once they
625*e4b17023SJohn Marinohave been lexed, it instead makes a copy of the token and adds the
626*e4b17023SJohn Marinoflag @code{NO_EXPAND} to the copy.
627*e4b17023SJohn Marino
628*e4b17023SJohn MarinoFor efficiency and to simplify memory management by avoiding having to
629*e4b17023SJohn Marinoremember to free these tokens, they are allocated as temporary tokens
630*e4b17023SJohn Marinofrom the lexer's current token run (@pxref{Lexing a line}) using the
631*e4b17023SJohn Marinofunction @code{_cpp_temp_token}.  The tokens are then re-used once the
632*e4b17023SJohn Marinocurrent line of tokens has been read in.
633*e4b17023SJohn Marino
634*e4b17023SJohn MarinoThis might sound unsafe.  However, tokens runs are not re-used at the
635*e4b17023SJohn Marinoend of a line if it happens to be in the middle of a macro argument
636*e4b17023SJohn Marinolist, and cpplib only wants to back-up more than one lexer token in
637*e4b17023SJohn Marinosituations where no macro expansion is involved, so the optimization
638*e4b17023SJohn Marinois safe.
639*e4b17023SJohn Marino
640*e4b17023SJohn Marino@node Token Spacing
641*e4b17023SJohn Marino@unnumbered Token Spacing
642*e4b17023SJohn Marino@cindex paste avoidance
643*e4b17023SJohn Marino@cindex spacing
644*e4b17023SJohn Marino@cindex token spacing
645*e4b17023SJohn Marino
646*e4b17023SJohn MarinoFirst, consider an issue that only concerns the stand-alone
647*e4b17023SJohn Marinopreprocessor: there needs to be a guarantee that re-reading its preprocessed
648*e4b17023SJohn Marinooutput results in an identical token stream.  Without taking special
649*e4b17023SJohn Marinomeasures, this might not be the case because of macro substitution.
650*e4b17023SJohn MarinoFor example:
651*e4b17023SJohn Marino
652*e4b17023SJohn Marino@smallexample
653*e4b17023SJohn Marino#define PLUS +
654*e4b17023SJohn Marino#define EMPTY
655*e4b17023SJohn Marino#define f(x) =x=
656*e4b17023SJohn Marino+PLUS -EMPTY- PLUS+ f(=)
657*e4b17023SJohn Marino        @expansion{} + + - - + + = = =
658*e4b17023SJohn Marino@emph{not}
659*e4b17023SJohn Marino        @expansion{} ++ -- ++ ===
660*e4b17023SJohn Marino@end smallexample
661*e4b17023SJohn Marino
662*e4b17023SJohn MarinoOne solution would be to simply insert a space between all adjacent
663*e4b17023SJohn Marinotokens.  However, we would like to keep space insertion to a minimum,
664*e4b17023SJohn Marinoboth for aesthetic reasons and because it causes problems for people who
665*e4b17023SJohn Marinostill try to abuse the preprocessor for things like Fortran source and
666*e4b17023SJohn MarinoMakefiles.
667*e4b17023SJohn Marino
668*e4b17023SJohn MarinoFor now, just notice that when tokens are added (or removed, as shown by
669*e4b17023SJohn Marinothe @code{EMPTY} example) from the original lexed token stream, we need
670*e4b17023SJohn Marinoto check for accidental token pasting.  We call this @dfn{paste
671*e4b17023SJohn Marinoavoidance}.  Token addition and removal can only occur because of macro
672*e4b17023SJohn Marinoexpansion, but accidental pasting can occur in many places: both before
673*e4b17023SJohn Marinoand after each macro replacement, each argument replacement, and
674*e4b17023SJohn Marinoadditionally each token created by the @samp{#} and @samp{##} operators.
675*e4b17023SJohn Marino
676*e4b17023SJohn MarinoLook at how the preprocessor gets whitespace output correct
677*e4b17023SJohn Marinonormally.  The @code{cpp_token} structure contains a flags byte, and one
678*e4b17023SJohn Marinoof those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
679*e4b17023SJohn Marinoindicates that the token was preceded by whitespace of some form other
680*e4b17023SJohn Marinothan a new line.  The stand-alone preprocessor can use this flag to
681*e4b17023SJohn Marinodecide whether to insert a space between tokens in the output.
682*e4b17023SJohn Marino
683*e4b17023SJohn MarinoNow consider the result of the following macro expansion:
684*e4b17023SJohn Marino
685*e4b17023SJohn Marino@smallexample
686*e4b17023SJohn Marino#define add(x, y, z) x + y +z;
687*e4b17023SJohn Marinosum = add (1,2, 3);
688*e4b17023SJohn Marino        @expansion{} sum = 1 + 2 +3;
689*e4b17023SJohn Marino@end smallexample
690*e4b17023SJohn Marino
691*e4b17023SJohn MarinoThe interesting thing here is that the tokens @samp{1} and @samp{2} are
692*e4b17023SJohn Marinooutput with a preceding space, and @samp{3} is output without a
693*e4b17023SJohn Marinopreceding space, but when lexed none of these tokens had that property.
694*e4b17023SJohn MarinoCareful consideration reveals that @samp{1} gets its preceding
695*e4b17023SJohn Marinowhitespace from the space preceding @samp{add} in the macro invocation,
696*e4b17023SJohn Marino@emph{not} replacement list.  @samp{2} gets its whitespace from the
697*e4b17023SJohn Marinospace preceding the parameter @samp{y} in the macro replacement list,
698*e4b17023SJohn Marinoand @samp{3} has no preceding space because parameter @samp{z} has none
699*e4b17023SJohn Marinoin the replacement list.
700*e4b17023SJohn Marino
701*e4b17023SJohn MarinoOnce lexed, tokens are effectively fixed and cannot be altered, since
702*e4b17023SJohn Marinopointers to them might be held in many places, in particular by
703*e4b17023SJohn Marinoin-progress macro expansions.  So instead of modifying the two tokens
704*e4b17023SJohn Marinoabove, the preprocessor inserts a special token, which I call a
705*e4b17023SJohn Marino@dfn{padding token}, into the token stream to indicate that spacing of
706*e4b17023SJohn Marinothe subsequent token is special.  The preprocessor inserts padding
707*e4b17023SJohn Marinotokens in front of every macro expansion and expanded macro argument.
708*e4b17023SJohn MarinoThese point to a @dfn{source token} from which the subsequent real token
709*e4b17023SJohn Marinoshould inherit its spacing.  In the above example, the source tokens are
710*e4b17023SJohn Marino@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
711*e4b17023SJohn Marinomacro replacement list, respectively.
712*e4b17023SJohn Marino
713*e4b17023SJohn MarinoIt is quite easy to get multiple padding tokens in a row, for example if
714*e4b17023SJohn Marinoa macro's first replacement token expands straight into another macro.
715*e4b17023SJohn Marino
716*e4b17023SJohn Marino@smallexample
717*e4b17023SJohn Marino#define foo bar
718*e4b17023SJohn Marino#define bar baz
719*e4b17023SJohn Marino[foo]
720*e4b17023SJohn Marino        @expansion{} [baz]
721*e4b17023SJohn Marino@end smallexample
722*e4b17023SJohn Marino
723*e4b17023SJohn MarinoHere, two padding tokens are generated with sources the @samp{foo} token
724*e4b17023SJohn Marinobetween the brackets, and the @samp{bar} token from foo's replacement
725*e4b17023SJohn Marinolist, respectively.  Clearly the first padding token is the one to
726*e4b17023SJohn Marinouse, so the output code should contain a rule that the first
727*e4b17023SJohn Marinopadding token in a sequence is the one that matters.
728*e4b17023SJohn Marino
729*e4b17023SJohn MarinoBut what if a macro expansion is left?  Adjusting the above
730*e4b17023SJohn Marinoexample slightly:
731*e4b17023SJohn Marino
732*e4b17023SJohn Marino@smallexample
733*e4b17023SJohn Marino#define foo bar
734*e4b17023SJohn Marino#define bar EMPTY baz
735*e4b17023SJohn Marino#define EMPTY
736*e4b17023SJohn Marino[foo] EMPTY;
737*e4b17023SJohn Marino        @expansion{} [ baz] ;
738*e4b17023SJohn Marino@end smallexample
739*e4b17023SJohn Marino
740*e4b17023SJohn MarinoAs shown, now there should be a space before @samp{baz} and the
741*e4b17023SJohn Marinosemicolon in the output.
742*e4b17023SJohn Marino
743*e4b17023SJohn MarinoThe rules we decided above fail for @samp{baz}: we generate three
744*e4b17023SJohn Marinopadding tokens, one per macro invocation, before the token @samp{baz}.
745*e4b17023SJohn MarinoWe would then have it take its spacing from the first of these, which
746*e4b17023SJohn Marinocarries source token @samp{foo} with no leading space.
747*e4b17023SJohn Marino
748*e4b17023SJohn MarinoIt is vital that cpplib get spacing correct in these examples since any
749*e4b17023SJohn Marinoof these macro expansions could be stringified, where spacing matters.
750*e4b17023SJohn Marino
751*e4b17023SJohn MarinoSo, this demonstrates that not just entering macro and argument
752*e4b17023SJohn Marinoexpansions, but leaving them requires special handling too.  I made
753*e4b17023SJohn Marinocpplib insert a padding token with a @code{NULL} source token when
754*e4b17023SJohn Marinoleaving macro expansions, as well as after each replaced argument in a
755*e4b17023SJohn Marinomacro's replacement list.  It also inserts appropriate padding tokens on
756*e4b17023SJohn Marinoeither side of tokens created by the @samp{#} and @samp{##} operators.
757*e4b17023SJohn MarinoI expanded the rule so that, if we see a padding token with a
758*e4b17023SJohn Marino@code{NULL} source token, @emph{and} that source token has no leading
759*e4b17023SJohn Marinospace, then we behave as if we have seen no padding tokens at all.  A
760*e4b17023SJohn Marinoquick check shows this rule will then get the above example correct as
761*e4b17023SJohn Marinowell.
762*e4b17023SJohn Marino
763*e4b17023SJohn MarinoNow a relationship with paste avoidance is apparent: we have to be
764*e4b17023SJohn Marinocareful about paste avoidance in exactly the same locations we have
765*e4b17023SJohn Marinopadding tokens in order to get white space correct.  This makes
766*e4b17023SJohn Marinoimplementation of paste avoidance easy: wherever the stand-alone
767*e4b17023SJohn Marinopreprocessor is fixing up spacing because of padding tokens, and it
768*e4b17023SJohn Marinoturns out that no space is needed, it has to take the extra step to
769*e4b17023SJohn Marinocheck that a space is not needed after all to avoid an accidental paste.
770*e4b17023SJohn MarinoThe function @code{cpp_avoid_paste} advises whether a space is required
771*e4b17023SJohn Marinobetween two consecutive tokens.  To avoid excessive spacing, it tries
772*e4b17023SJohn Marinohard to only require a space if one is likely to be necessary, but for
773*e4b17023SJohn Marinoreasons of efficiency it is slightly conservative and might recommend a
774*e4b17023SJohn Marinospace where one is not strictly needed.
775*e4b17023SJohn Marino
776*e4b17023SJohn Marino@node Line Numbering
777*e4b17023SJohn Marino@unnumbered Line numbering
778*e4b17023SJohn Marino@cindex line numbers
779*e4b17023SJohn Marino
780*e4b17023SJohn Marino@section Just which line number anyway?
781*e4b17023SJohn Marino
782*e4b17023SJohn MarinoThere are three reasonable requirements a cpplib client might have for
783*e4b17023SJohn Marinothe line number of a token passed to it:
784*e4b17023SJohn Marino
785*e4b17023SJohn Marino@itemize @bullet
786*e4b17023SJohn Marino@item
787*e4b17023SJohn MarinoThe source line it was lexed on.
788*e4b17023SJohn Marino@item
789*e4b17023SJohn MarinoThe line it is output on.  This can be different to the line it was
790*e4b17023SJohn Marinolexed on if, for example, there are intervening escaped newlines or
791*e4b17023SJohn MarinoC-style comments.  For example:
792*e4b17023SJohn Marino
793*e4b17023SJohn Marino@smallexample
794*e4b17023SJohn Marinofoo /* @r{A long
795*e4b17023SJohn Marinocomment} */ bar \
796*e4b17023SJohn Marinobaz
797*e4b17023SJohn Marino@result{}
798*e4b17023SJohn Marinofoo bar baz
799*e4b17023SJohn Marino@end smallexample
800*e4b17023SJohn Marino
801*e4b17023SJohn Marino@item
802*e4b17023SJohn MarinoIf the token results from a macro expansion, the line of the macro name,
803*e4b17023SJohn Marinoor possibly the line of the closing parenthesis in the case of
804*e4b17023SJohn Marinofunction-like macro expansion.
805*e4b17023SJohn Marino@end itemize
806*e4b17023SJohn Marino
807*e4b17023SJohn MarinoThe @code{cpp_token} structure contains @code{line} and @code{col}
808*e4b17023SJohn Marinomembers.  The lexer fills these in with the line and column of the first
809*e4b17023SJohn Marinocharacter of the token.  Consequently, but maybe unexpectedly, a token
810*e4b17023SJohn Marinofrom the replacement list of a macro expansion carries the location of
811*e4b17023SJohn Marinothe token within the @code{#define} directive, because cpplib expands a
812*e4b17023SJohn Marinomacro by returning pointers to the tokens in its replacement list.  The
813*e4b17023SJohn Marinocurrent implementation of cpplib assigns tokens created from built-in
814*e4b17023SJohn Marinomacros and the @samp{#} and @samp{##} operators the location of the most
815*e4b17023SJohn Marinorecently lexed token.  This is a because they are allocated from the
816*e4b17023SJohn Marinolexer's token runs, and because of the way the diagnostic routines infer
817*e4b17023SJohn Marinothe appropriate location to report.
818*e4b17023SJohn Marino
819*e4b17023SJohn MarinoThe diagnostic routines in cpplib display the location of the most
820*e4b17023SJohn Marinorecently @emph{lexed} token, unless they are passed a specific line and
821*e4b17023SJohn Marinocolumn to report.  For diagnostics regarding tokens that arise from
822*e4b17023SJohn Marinomacro expansions, it might also be helpful for the user to see the
823*e4b17023SJohn Marinooriginal location in the macro definition that the token came from.
824*e4b17023SJohn MarinoSince that is exactly the information each token carries, such an
825*e4b17023SJohn Marinoenhancement could be made relatively easily in future.
826*e4b17023SJohn Marino
827*e4b17023SJohn MarinoThe stand-alone preprocessor faces a similar problem when determining
828*e4b17023SJohn Marinothe correct line to output the token on: the position attached to a
829*e4b17023SJohn Marinotoken is fairly useless if the token came from a macro expansion.  All
830*e4b17023SJohn Marinotokens on a logical line should be output on its first physical line, so
831*e4b17023SJohn Marinothe token's reported location is also wrong if it is part of a physical
832*e4b17023SJohn Marinoline other than the first.
833*e4b17023SJohn Marino
834*e4b17023SJohn MarinoTo solve these issues, cpplib provides a callback that is generated
835*e4b17023SJohn Marinowhenever it lexes a preprocessing token that starts a new logical line
836*e4b17023SJohn Marinoother than a directive.  It passes this token (which may be a
837*e4b17023SJohn Marino@code{CPP_EOF} token indicating the end of the translation unit) to the
838*e4b17023SJohn Marinocallback routine, which can then use the line and column of this token
839*e4b17023SJohn Marinoto produce correct output.
840*e4b17023SJohn Marino
841*e4b17023SJohn Marino@section Representation of line numbers
842*e4b17023SJohn Marino
843*e4b17023SJohn MarinoAs mentioned above, cpplib stores with each token the line number that
844*e4b17023SJohn Marinoit was lexed on.  In fact, this number is not the number of the line in
845*e4b17023SJohn Marinothe source file, but instead bears more resemblance to the number of the
846*e4b17023SJohn Marinoline in the translation unit.
847*e4b17023SJohn Marino
848*e4b17023SJohn MarinoThe preprocessor maintains a monotonic increasing line count, which is
849*e4b17023SJohn Marinoincremented at every new line character (and also at the end of any
850*e4b17023SJohn Marinobuffer that does not end in a new line).  Since a line number of zero is
851*e4b17023SJohn Marinouseful to indicate certain special states and conditions, this variable
852*e4b17023SJohn Marinostarts counting from one.
853*e4b17023SJohn Marino
854*e4b17023SJohn MarinoThis variable therefore uniquely enumerates each line in the translation
855*e4b17023SJohn Marinounit.  With some simple infrastructure, it is straight forward to map
856*e4b17023SJohn Marinofrom this to the original source file and line number pair, saving space
857*e4b17023SJohn Marinowhenever line number information needs to be saved.  The code the
858*e4b17023SJohn Marinoimplements this mapping lies in the files @file{line-map.c} and
859*e4b17023SJohn Marino@file{line-map.h}.
860*e4b17023SJohn Marino
861*e4b17023SJohn MarinoCommand-line macros and assertions are implemented by pushing a buffer
862*e4b17023SJohn Marinocontaining the right hand side of an equivalent @code{#define} or
863*e4b17023SJohn Marino@code{#assert} directive.  Some built-in macros are handled similarly.
864*e4b17023SJohn MarinoSince these are all processed before the first line of the main input
865*e4b17023SJohn Marinofile, it will typically have an assigned line closer to twenty than to
866*e4b17023SJohn Marinoone.
867*e4b17023SJohn Marino
868*e4b17023SJohn Marino@node Guard Macros
869*e4b17023SJohn Marino@unnumbered The Multiple-Include Optimization
870*e4b17023SJohn Marino@cindex guard macros
871*e4b17023SJohn Marino@cindex controlling macros
872*e4b17023SJohn Marino@cindex multiple-include optimization
873*e4b17023SJohn Marino
874*e4b17023SJohn MarinoHeader files are often of the form
875*e4b17023SJohn Marino
876*e4b17023SJohn Marino@smallexample
877*e4b17023SJohn Marino#ifndef FOO
878*e4b17023SJohn Marino#define FOO
879*e4b17023SJohn Marino@dots{}
880*e4b17023SJohn Marino#endif
881*e4b17023SJohn Marino@end smallexample
882*e4b17023SJohn Marino
883*e4b17023SJohn Marino@noindent
884*e4b17023SJohn Marinoto prevent the compiler from processing them more than once.  The
885*e4b17023SJohn Marinopreprocessor notices such header files, so that if the header file
886*e4b17023SJohn Marinoappears in a subsequent @code{#include} directive and @code{FOO} is
887*e4b17023SJohn Marinodefined, then it is ignored and it doesn't preprocess or even re-open
888*e4b17023SJohn Marinothe file a second time.  This is referred to as the @dfn{multiple
889*e4b17023SJohn Marinoinclude optimization}.
890*e4b17023SJohn Marino
891*e4b17023SJohn MarinoUnder what circumstances is such an optimization valid?  If the file
892*e4b17023SJohn Marinowere included a second time, it can only be optimized away if that
893*e4b17023SJohn Marinoinclusion would result in no tokens to return, and no relevant
894*e4b17023SJohn Marinodirectives to process.  Therefore the current implementation imposes
895*e4b17023SJohn Marinorequirements and makes some allowances as follows:
896*e4b17023SJohn Marino
897*e4b17023SJohn Marino@enumerate
898*e4b17023SJohn Marino@item
899*e4b17023SJohn MarinoThere must be no tokens outside the controlling @code{#if}-@code{#endif}
900*e4b17023SJohn Marinopair, but whitespace and comments are permitted.
901*e4b17023SJohn Marino
902*e4b17023SJohn Marino@item
903*e4b17023SJohn MarinoThere must be no directives outside the controlling directive pair, but
904*e4b17023SJohn Marinothe @dfn{null directive} (a line containing nothing other than a single
905*e4b17023SJohn Marino@samp{#} and possibly whitespace) is permitted.
906*e4b17023SJohn Marino
907*e4b17023SJohn Marino@item
908*e4b17023SJohn MarinoThe opening directive must be of the form
909*e4b17023SJohn Marino
910*e4b17023SJohn Marino@smallexample
911*e4b17023SJohn Marino#ifndef FOO
912*e4b17023SJohn Marino@end smallexample
913*e4b17023SJohn Marino
914*e4b17023SJohn Marinoor
915*e4b17023SJohn Marino
916*e4b17023SJohn Marino@smallexample
917*e4b17023SJohn Marino#if !defined FOO     [equivalently, #if !defined(FOO)]
918*e4b17023SJohn Marino@end smallexample
919*e4b17023SJohn Marino
920*e4b17023SJohn Marino@item
921*e4b17023SJohn MarinoIn the second form above, the tokens forming the @code{#if} expression
922*e4b17023SJohn Marinomust have come directly from the source file---no macro expansion must
923*e4b17023SJohn Marinohave been involved.  This is because macro definitions can change, and
924*e4b17023SJohn Marinotracking whether or not a relevant change has been made is not worth the
925*e4b17023SJohn Marinoimplementation cost.
926*e4b17023SJohn Marino
927*e4b17023SJohn Marino@item
928*e4b17023SJohn MarinoThere can be no @code{#else} or @code{#elif} directives at the outer
929*e4b17023SJohn Marinoconditional block level, because they would probably contain something
930*e4b17023SJohn Marinoof interest to a subsequent pass.
931*e4b17023SJohn Marino@end enumerate
932*e4b17023SJohn Marino
933*e4b17023SJohn MarinoFirst, when pushing a new file on the buffer stack,
934*e4b17023SJohn Marino@code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to
935*e4b17023SJohn Marino@code{NULL}, and sets @code{mi_valid} to @code{true}.  This indicates
936*e4b17023SJohn Marinothat the preprocessor has not yet encountered anything that would
937*e4b17023SJohn Marinoinvalidate the multiple-include optimization.  As described in the next
938*e4b17023SJohn Marinofew paragraphs, these two variables having these values effectively
939*e4b17023SJohn Marinoindicates top-of-file.
940*e4b17023SJohn Marino
941*e4b17023SJohn MarinoWhen about to return a token that is not part of a directive,
942*e4b17023SJohn Marino@code{_cpp_lex_token} sets @code{mi_valid} to @code{false}.  This
943*e4b17023SJohn Marinoenforces the constraint that tokens outside the controlling conditional
944*e4b17023SJohn Marinoblock invalidate the optimization.
945*e4b17023SJohn Marino
946*e4b17023SJohn MarinoThe @code{do_if}, when appropriate, and @code{do_ifndef} directive
947*e4b17023SJohn Marinohandlers pass the controlling macro to the function
948*e4b17023SJohn Marino@code{push_conditional}.  cpplib maintains a stack of nested conditional
949*e4b17023SJohn Marinoblocks, and after processing every opening conditional this function
950*e4b17023SJohn Marinopushes an @code{if_stack} structure onto the stack.  In this structure
951*e4b17023SJohn Marinoit records the controlling macro for the block, provided there is one
952*e4b17023SJohn Marinoand we're at top-of-file (as described above).  If an @code{#elif} or
953*e4b17023SJohn Marino@code{#else} directive is encountered, the controlling macro for that
954*e4b17023SJohn Marinoblock is cleared to @code{NULL}.  Otherwise, it survives until the
955*e4b17023SJohn Marino@code{#endif} closing the block, upon which @code{do_endif} sets
956*e4b17023SJohn Marino@code{mi_valid} to true and stores the controlling macro in
957*e4b17023SJohn Marino@code{mi_cmacro}.
958*e4b17023SJohn Marino
959*e4b17023SJohn Marino@code{_cpp_handle_directive} clears @code{mi_valid} when processing any
960*e4b17023SJohn Marinodirective other than an opening conditional and the null directive.
961*e4b17023SJohn MarinoWith this, and requiring top-of-file to record a controlling macro, and
962*e4b17023SJohn Marinono @code{#else} or @code{#elif} for it to survive and be copied to
963*e4b17023SJohn Marino@code{mi_cmacro} by @code{do_endif}, we have enforced the absence of
964*e4b17023SJohn Marinodirectives outside the main conditional block for the optimization to be
965*e4b17023SJohn Marinoon.
966*e4b17023SJohn Marino
967*e4b17023SJohn MarinoNote that whilst we are inside the conditional block, @code{mi_valid} is
968*e4b17023SJohn Marinolikely to be reset to @code{false}, but this does not matter since
969*e4b17023SJohn Marinothe closing @code{#endif} restores it to @code{true} if appropriate.
970*e4b17023SJohn Marino
971*e4b17023SJohn MarinoFinally, since @code{_cpp_lex_direct} pops the file off the buffer stack
972*e4b17023SJohn Marinoat @code{EOF} without returning a token, if the @code{#endif} directive
973*e4b17023SJohn Marinowas not followed by any tokens, @code{mi_valid} is @code{true} and
974*e4b17023SJohn Marino@code{_cpp_pop_file_buffer} remembers the controlling macro associated
975*e4b17023SJohn Marinowith the file.  Subsequent calls to @code{stack_include_file} result in
976*e4b17023SJohn Marinono buffer being pushed if the controlling macro is defined, effecting
977*e4b17023SJohn Marinothe optimization.
978*e4b17023SJohn Marino
979*e4b17023SJohn MarinoA quick word on how we handle the
980*e4b17023SJohn Marino
981*e4b17023SJohn Marino@smallexample
982*e4b17023SJohn Marino#if !defined FOO
983*e4b17023SJohn Marino@end smallexample
984*e4b17023SJohn Marino
985*e4b17023SJohn Marino@noindent
986*e4b17023SJohn Marinocase.  @code{_cpp_parse_expr} and @code{parse_defined} take steps to see
987*e4b17023SJohn Marinowhether the three stages @samp{!}, @samp{defined-expression} and
988*e4b17023SJohn Marino@samp{end-of-directive} occur in order in a @code{#if} expression.  If
989*e4b17023SJohn Marinoso, they return the guard macro to @code{do_if} in the variable
990*e4b17023SJohn Marino@code{mi_ind_cmacro}, and otherwise set it to @code{NULL}.
991*e4b17023SJohn Marino@code{enter_macro_context} sets @code{mi_valid} to false, so if a macro
992*e4b17023SJohn Marinowas expanded whilst parsing any part of the expression, then the
993*e4b17023SJohn Marinotop-of-file test in @code{push_conditional} fails and the optimization
994*e4b17023SJohn Marinois turned off.
995*e4b17023SJohn Marino
996*e4b17023SJohn Marino@node Files
997*e4b17023SJohn Marino@unnumbered File Handling
998*e4b17023SJohn Marino@cindex files
999*e4b17023SJohn Marino
1000*e4b17023SJohn MarinoFairly obviously, the file handling code of cpplib resides in the file
1001*e4b17023SJohn Marino@file{files.c}.  It takes care of the details of file searching,
1002*e4b17023SJohn Marinoopening, reading and caching, for both the main source file and all the
1003*e4b17023SJohn Marinoheaders it recursively includes.
1004*e4b17023SJohn Marino
1005*e4b17023SJohn MarinoThe basic strategy is to minimize the number of system calls.  On many
1006*e4b17023SJohn Marinosystems, the basic @code{open ()} and @code{fstat ()} system calls can
1007*e4b17023SJohn Marinobe quite expensive.  For every @code{#include}-d file, we need to try
1008*e4b17023SJohn Marinoall the directories in the search path until we find a match.  Some
1009*e4b17023SJohn Marinoprojects, such as glibc, pass twenty or thirty include paths on the
1010*e4b17023SJohn Marinocommand line, so this can rapidly become time consuming.
1011*e4b17023SJohn Marino
1012*e4b17023SJohn MarinoFor a header file we have not encountered before we have little choice
1013*e4b17023SJohn Marinobut to do this.  However, it is often the case that the same headers are
1014*e4b17023SJohn Marinorepeatedly included, and in these cases we try to avoid repeating the
1015*e4b17023SJohn Marinofilesystem queries whilst searching for the correct file.
1016*e4b17023SJohn Marino
1017*e4b17023SJohn MarinoFor each file we try to open, we store the constructed path in a splay
1018*e4b17023SJohn Marinotree.  This path first undergoes simplification by the function
1019*e4b17023SJohn Marino@code{_cpp_simplify_pathname}.  For example,
1020*e4b17023SJohn Marino@file{/usr/include/bits/../foo.h} is simplified to
1021*e4b17023SJohn Marino@file{/usr/include/foo.h} before we enter it in the splay tree and try
1022*e4b17023SJohn Marinoto @code{open ()} the file.  CPP will then find subsequent uses of
1023*e4b17023SJohn Marino@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
1024*e4b17023SJohn Marinosave system calls.
1025*e4b17023SJohn Marino
1026*e4b17023SJohn MarinoFurther, it is likely the file contents have also been cached, saving a
1027*e4b17023SJohn Marino@code{read ()} system call.  We don't bother caching the contents of
1028*e4b17023SJohn Marinoheader files that are re-inclusion protected, and whose re-inclusion
1029*e4b17023SJohn Marinomacro is defined when we leave the header file for the first time.  If
1030*e4b17023SJohn Marinothe host supports it, we try to map suitably large files into memory,
1031*e4b17023SJohn Marinorather than reading them in directly.
1032*e4b17023SJohn Marino
1033*e4b17023SJohn MarinoThe include paths are internally stored on a null-terminated
1034*e4b17023SJohn Marinosingly-linked list, starting with the @code{"header.h"} directory search
1035*e4b17023SJohn Marinochain, which then links into the @code{<header.h>} directory chain.
1036*e4b17023SJohn Marino
1037*e4b17023SJohn MarinoFiles included with the @code{<foo.h>} syntax start the lookup directly
1038*e4b17023SJohn Marinoin the second half of this chain.  However, files included with the
1039*e4b17023SJohn Marino@code{"foo.h"} syntax start at the beginning of the chain, but with one
1040*e4b17023SJohn Marinoextra directory prepended.  This is the directory of the current file;
1041*e4b17023SJohn Marinothe one containing the @code{#include} directive.  Prepending this
1042*e4b17023SJohn Marinodirectory on a per-file basis is handled by the function
1043*e4b17023SJohn Marino@code{search_from}.
1044*e4b17023SJohn Marino
1045*e4b17023SJohn MarinoNote that a header included with a directory component, such as
1046*e4b17023SJohn Marino@code{#include "mydir/foo.h"} and opened as
1047*e4b17023SJohn Marino@file{/usr/local/include/mydir/foo.h}, will have the complete path minus
1048*e4b17023SJohn Marinothe basename @samp{foo.h} as the current directory.
1049*e4b17023SJohn Marino
1050*e4b17023SJohn MarinoEnough information is stored in the splay tree that CPP can immediately
1051*e4b17023SJohn Marinotell whether it can skip the header file because of the multiple include
1052*e4b17023SJohn Marinooptimization, whether the file didn't exist or couldn't be opened for
1053*e4b17023SJohn Marinosome reason, or whether the header was flagged not to be re-used, as it
1054*e4b17023SJohn Marinois with the obsolete @code{#import} directive.
1055*e4b17023SJohn Marino
1056*e4b17023SJohn MarinoFor the benefit of MS-DOS filesystems with an 8.3 filename limitation,
1057*e4b17023SJohn MarinoCPP offers the ability to treat various include file names as aliases
1058*e4b17023SJohn Marinofor the real header files with shorter names.  The map from one to the
1059*e4b17023SJohn Marinoother is found in a special file called @samp{header.gcc}, stored in the
1060*e4b17023SJohn Marinocommand line (or system) include directories to which the mapping
1061*e4b17023SJohn Marinoapplies.  This may be higher up the directory tree than the full path to
1062*e4b17023SJohn Marinothe file minus the base name.
1063*e4b17023SJohn Marino
1064*e4b17023SJohn Marino@node Concept Index
1065*e4b17023SJohn Marino@unnumbered Concept Index
1066*e4b17023SJohn Marino@printindex cp
1067*e4b17023SJohn Marino
1068*e4b17023SJohn Marino@bye
1069