1============================ 2"Clang" CFE Internals Manual 3============================ 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document describes some of the more important APIs and internal design 12decisions made in the Clang C front-end. The purpose of this document is to 13both capture some of this high level information and also describe some of the 14design decisions behind it. This is meant for people interested in hacking on 15Clang, not for end-users. The description below is categorized by libraries, 16and does not describe any of the clients of the libraries. 17 18LLVM Support Library 19==================== 20 21The LLVM ``libSupport`` library provides many underlying libraries and 22`data-structures <https://llvm.org/docs/ProgrammersManual.html>`_, including 23command line option processing, various containers and a system abstraction 24layer, which is used for file system access. 25 26The Clang "Basic" Library 27========================= 28 29This library certainly needs a better name. The "basic" library contains a 30number of low-level utilities for tracking and manipulating source buffers, 31locations within the source buffers, diagnostics, tokens, target abstraction, 32and information about the subset of the language being compiled for. 33 34Part of this infrastructure is specific to C (such as the ``TargetInfo`` 35class), other parts could be reused for other non-C-based languages 36(``SourceLocation``, ``SourceManager``, ``Diagnostics``, ``FileManager``). 37When and if there is future demand we can figure out if it makes sense to 38introduce a new library, move the general classes somewhere else, or introduce 39some other solution. 40 41We describe the roles of these classes in order of their dependencies. 42 43The Diagnostics Subsystem 44------------------------- 45 46The Clang Diagnostics subsystem is an important part of how the compiler 47communicates with the human. Diagnostics are the warnings and errors produced 48when the code is incorrect or dubious. In Clang, each diagnostic produced has 49(at the minimum) a unique ID, an English translation associated with it, a 50:ref:`SourceLocation <SourceLocation>` to "put the caret", and a severity 51(e.g., ``WARNING`` or ``ERROR``). They can also optionally include a number of 52arguments to the diagnostic (which fill in "%0"'s in the string) as well as a 53number of source ranges that related to the diagnostic. 54 55In this section, we'll be giving examples produced by the Clang command line 56driver, but diagnostics can be :ref:`rendered in many different ways 57<DiagnosticConsumer>` depending on how the ``DiagnosticConsumer`` interface is 58implemented. A representative example of a diagnostic is: 59 60.. code-block:: text 61 62 t.c:38:15: error: invalid operands to binary expression ('int *' and '_Complex float') 63 P = (P-42) + Gamma*4; 64 ~~~~~~ ^ ~~~~~~~ 65 66In this example, you can see the English translation, the severity (error), you 67can see the source location (the caret ("``^``") and file/line/column info), 68the source ranges "``~~~~``", arguments to the diagnostic ("``int*``" and 69"``_Complex float``"). You'll have to believe me that there is a unique ID 70backing the diagnostic :). 71 72Getting all of this to happen has several steps and involves many moving 73pieces, this section describes them and talks about best practices when adding 74a new diagnostic. 75 76The ``Diagnostic*Kinds.td`` files 77^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 78 79Diagnostics are created by adding an entry to one of the 80``clang/Basic/Diagnostic*Kinds.td`` files, depending on what library will be 81using it. From this file, :program:`tblgen` generates the unique ID of the 82diagnostic, the severity of the diagnostic and the English translation + format 83string. 84 85There is little sanity with the naming of the unique ID's right now. Some 86start with ``err_``, ``warn_``, ``ext_`` to encode the severity into the name. 87Since the enum is referenced in the C++ code that produces the diagnostic, it 88is somewhat useful for it to be reasonably short. 89 90The severity of the diagnostic comes from the set {``NOTE``, ``REMARK``, 91``WARNING``, 92``EXTENSION``, ``EXTWARN``, ``ERROR``}. The ``ERROR`` severity is used for 93diagnostics indicating the program is never acceptable under any circumstances. 94When an error is emitted, the AST for the input code may not be fully built. 95The ``EXTENSION`` and ``EXTWARN`` severities are used for extensions to the 96language that Clang accepts. This means that Clang fully understands and can 97represent them in the AST, but we produce diagnostics to tell the user their 98code is non-portable. The difference is that the former are ignored by 99default, and the later warn by default. The ``WARNING`` severity is used for 100constructs that are valid in the currently selected source language but that 101are dubious in some way. The ``REMARK`` severity provides generic information 102about the compilation that is not necessarily related to any dubious code. The 103``NOTE`` level is used to staple more information onto previous diagnostics. 104 105These *severities* are mapped into a smaller set (the ``Diagnostic::Level`` 106enum, {``Ignored``, ``Note``, ``Remark``, ``Warning``, ``Error``, ``Fatal``}) of 107output 108*levels* by the diagnostics subsystem based on various configuration options. 109Clang internally supports a fully fine grained mapping mechanism that allows 110you to map almost any diagnostic to the output level that you want. The only 111diagnostics that cannot be mapped are ``NOTE``\ s, which always follow the 112severity of the previously emitted diagnostic and ``ERROR``\ s, which can only 113be mapped to ``Fatal`` (it is not possible to turn an error into a warning, for 114example). 115 116Diagnostic mappings are used in many ways. For example, if the user specifies 117``-pedantic``, ``EXTENSION`` maps to ``Warning``, if they specify 118``-pedantic-errors``, it turns into ``Error``. This is used to implement 119options like ``-Wunused_macros``, ``-Wundef`` etc. 120 121Mapping to ``Fatal`` should only be used for diagnostics that are considered so 122severe that error recovery won't be able to recover sensibly from them (thus 123spewing a ton of bogus errors). One example of this class of error are failure 124to ``#include`` a file. 125 126The Format String 127^^^^^^^^^^^^^^^^^ 128 129The format string for the diagnostic is very simple, but it has some power. It 130takes the form of a string in English with markers that indicate where and how 131arguments to the diagnostic are inserted and formatted. For example, here are 132some simple format strings: 133 134.. code-block:: c++ 135 136 "binary integer literals are an extension" 137 "format string contains '\\0' within the string body" 138 "more '%%' conversions than data arguments" 139 "invalid operands to binary expression (%0 and %1)" 140 "overloaded '%0' must be a %select{unary|binary|unary or binary}2 operator" 141 " (has %1 parameter%s1)" 142 143These examples show some important points of format strings. You can use any 144plain ASCII character in the diagnostic string except "``%``" without a 145problem, but these are C strings, so you have to use and be aware of all the C 146escape sequences (as in the second example). If you want to produce a "``%``" 147in the output, use the "``%%``" escape sequence, like the third diagnostic. 148Finally, Clang uses the "``%...[digit]``" sequences to specify where and how 149arguments to the diagnostic are formatted. 150 151Arguments to the diagnostic are numbered according to how they are specified by 152the C++ code that :ref:`produces them <internals-producing-diag>`, and are 153referenced by ``%0`` .. ``%9``. If you have more than 10 arguments to your 154diagnostic, you are doing something wrong :). Unlike ``printf``, there is no 155requirement that arguments to the diagnostic end up in the output in the same 156order as they are specified, you could have a format string with "``%1 %0``" 157that swaps them, for example. The text in between the percent and digit are 158formatting instructions. If there are no instructions, the argument is just 159turned into a string and substituted in. 160 161Here are some "best practices" for writing the English format string: 162 163* Keep the string short. It should ideally fit in the 80 column limit of the 164 ``DiagnosticKinds.td`` file. This avoids the diagnostic wrapping when 165 printed, and forces you to think about the important point you are conveying 166 with the diagnostic. 167* Take advantage of location information. The user will be able to see the 168 line and location of the caret, so you don't need to tell them that the 169 problem is with the 4th argument to the function: just point to it. 170* Do not capitalize the diagnostic string, and do not end it with a period. 171* If you need to quote something in the diagnostic string, use single quotes. 172 173Diagnostics should never take random English strings as arguments: you 174shouldn't use "``you have a problem with %0``" and pass in things like "``your 175argument``" or "``your return value``" as arguments. Doing this prevents 176:ref:`translating <internals-diag-translation>` the Clang diagnostics to other 177languages (because they'll get random English words in their otherwise 178localized diagnostic). The exceptions to this are C/C++ language keywords 179(e.g., ``auto``, ``const``, ``mutable``, etc) and C/C++ operators (``/=``). 180Note that things like "pointer" and "reference" are not keywords. On the other 181hand, you *can* include anything that comes from the user's source code, 182including variable names, types, labels, etc. The "``select``" format can be 183used to achieve this sort of thing in a localizable way, see below. 184 185Formatting a Diagnostic Argument 186^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 187 188Arguments to diagnostics are fully typed internally, and come from a couple 189different classes: integers, types, names, and random strings. Depending on 190the class of the argument, it can be optionally formatted in different ways. 191This gives the ``DiagnosticConsumer`` information about what the argument means 192without requiring it to use a specific presentation (consider this MVC for 193Clang :). 194 195It is really easy to add format specifiers to the Clang diagnostics system, but 196they should be discussed before they are added. If you are creating a lot of 197repetitive diagnostics and/or have an idea for a useful formatter, please bring 198it up on the cfe-dev mailing list. 199 200Here are the different diagnostic argument formats currently supported by 201Clang: 202 203**"s" format** 204 205Example: 206 ``"requires %0 parameter%s0"`` 207Class: 208 Integers 209Description: 210 This is a simple formatter for integers that is useful when producing English 211 diagnostics. When the integer is 1, it prints as nothing. When the integer 212 is not 1, it prints as "``s``". This allows some simple grammatical forms to 213 be to be handled correctly, and eliminates the need to use gross things like 214 ``"requires %1 parameter(s)"``. Note, this only handles adding a simple 215 "``s``" character, it will not handle situations where pluralization is more 216 complicated such as turning ``fancy`` into ``fancies`` or ``mouse`` into 217 ``mice``. You can use the "plural" format specifier to handle such situations. 218 219**"select" format** 220 221Example: 222 ``"must be a %select{unary|binary|unary or binary}0 operator"`` 223Class: 224 Integers 225Description: 226 This format specifier is used to merge multiple related diagnostics together 227 into one common one, without requiring the difference to be specified as an 228 English string argument. Instead of specifying the string, the diagnostic 229 gets an integer argument and the format string selects the numbered option. 230 In this case, the "``%0``" value must be an integer in the range [0..2]. If 231 it is 0, it prints "unary", if it is 1 it prints "binary" if it is 2, it 232 prints "unary or binary". This allows other language translations to 233 substitute reasonable words (or entire phrases) based on the semantics of the 234 diagnostic instead of having to do things textually. The selected string 235 does undergo formatting. 236 237**"plural" format** 238 239Example: 240 ``"you have %0 %plural{1:mouse|:mice}0 connected to your computer"`` 241Class: 242 Integers 243Description: 244 This is a formatter for complex plural forms. It is designed to handle even 245 the requirements of languages with very complex plural forms, as many Baltic 246 languages have. The argument consists of a series of expression/form pairs, 247 separated by ":", where the first form whose expression evaluates to true is 248 the result of the modifier. 249 250 An expression can be empty, in which case it is always true. See the example 251 at the top. Otherwise, it is a series of one or more numeric conditions, 252 separated by ",". If any condition matches, the expression matches. Each 253 numeric condition can take one of three forms. 254 255 * number: A simple decimal number matches if the argument is the same as the 256 number. Example: ``"%plural{1:mouse|:mice}0"`` 257 * range: A range in square brackets matches if the argument is within the 258 range. Then range is inclusive on both ends. Example: 259 ``"%plural{0:none|1:one|[2,5]:some|:many}0"`` 260 * modulo: A modulo operator is followed by a number, and equals sign and 261 either a number or a range. The tests are the same as for plain numbers 262 and ranges, but the argument is taken modulo the number first. Example: 263 ``"%plural{%100=0:even hundred|%100=[1,50]:lower half|:everything else}1"`` 264 265 The parser is very unforgiving. A syntax error, even whitespace, will abort, 266 as will a failure to match the argument against any expression. 267 268**"ordinal" format** 269 270Example: 271 ``"ambiguity in %ordinal0 argument"`` 272Class: 273 Integers 274Description: 275 This is a formatter which represents the argument number as an ordinal: the 276 value ``1`` becomes ``1st``, ``3`` becomes ``3rd``, and so on. Values less 277 than ``1`` are not supported. This formatter is currently hard-coded to use 278 English ordinals. 279 280**"objcclass" format** 281 282Example: 283 ``"method %objcclass0 not found"`` 284Class: 285 ``DeclarationName`` 286Description: 287 This is a simple formatter that indicates the ``DeclarationName`` corresponds 288 to an Objective-C class method selector. As such, it prints the selector 289 with a leading "``+``". 290 291**"objcinstance" format** 292 293Example: 294 ``"method %objcinstance0 not found"`` 295Class: 296 ``DeclarationName`` 297Description: 298 This is a simple formatter that indicates the ``DeclarationName`` corresponds 299 to an Objective-C instance method selector. As such, it prints the selector 300 with a leading "``-``". 301 302**"q" format** 303 304Example: 305 ``"candidate found by name lookup is %q0"`` 306Class: 307 ``NamedDecl *`` 308Description: 309 This formatter indicates that the fully-qualified name of the declaration 310 should be printed, e.g., "``std::vector``" rather than "``vector``". 311 312**"diff" format** 313 314Example: 315 ``"no known conversion %diff{from $ to $|from argument type to parameter type}1,2"`` 316Class: 317 ``QualType`` 318Description: 319 This formatter takes two ``QualType``\ s and attempts to print a template 320 difference between the two. If tree printing is off, the text inside the 321 braces before the pipe is printed, with the formatted text replacing the $. 322 If tree printing is on, the text after the pipe is printed and a type tree is 323 printed after the diagnostic message. 324 325**"sub" format** 326 327Example: 328 Given the following record definition of type ``TextSubstitution``: 329 330 .. code-block:: text 331 332 def select_ovl_candidate : TextSubstitution< 333 "%select{function|constructor}0%select{| template| %2}1">; 334 335 which can be used as 336 337 .. code-block:: text 338 339 def note_ovl_candidate : Note< 340 "candidate %sub{select_ovl_candidate}3,2,1 not viable">; 341 342 and will act as if it was written 343 ``"candidate %select{function|constructor}3%select{| template| %1}2 not viable"``. 344Description: 345 This format specifier is used to avoid repeating strings verbatim in multiple 346 diagnostics. The argument to ``%sub`` must name a ``TextSubstitution`` tblgen 347 record. The substitution must specify all arguments used by the substitution, 348 and the modifier indexes in the substitution are re-numbered accordingly. The 349 substituted text must itself be a valid format string before substitution. 350 351.. _internals-producing-diag: 352 353Producing the Diagnostic 354^^^^^^^^^^^^^^^^^^^^^^^^ 355 356Now that you've created the diagnostic in the ``Diagnostic*Kinds.td`` file, you 357need to write the code that detects the condition in question and emits the new 358diagnostic. Various components of Clang (e.g., the preprocessor, ``Sema``, 359etc.) provide a helper function named "``Diag``". It creates a diagnostic and 360accepts the arguments, ranges, and other information that goes along with it. 361 362For example, the binary expression error comes from code like this: 363 364.. code-block:: c++ 365 366 if (various things that are bad) 367 Diag(Loc, diag::err_typecheck_invalid_operands) 368 << lex->getType() << rex->getType() 369 << lex->getSourceRange() << rex->getSourceRange(); 370 371This shows that use of the ``Diag`` method: it takes a location (a 372:ref:`SourceLocation <SourceLocation>` object) and a diagnostic enum value 373(which matches the name from ``Diagnostic*Kinds.td``). If the diagnostic takes 374arguments, they are specified with the ``<<`` operator: the first argument 375becomes ``%0``, the second becomes ``%1``, etc. The diagnostic interface 376allows you to specify arguments of many different types, including ``int`` and 377``unsigned`` for integer arguments, ``const char*`` and ``std::string`` for 378string arguments, ``DeclarationName`` and ``const IdentifierInfo *`` for names, 379``QualType`` for types, etc. ``SourceRange``\ s are also specified with the 380``<<`` operator, but do not have a specific ordering requirement. 381 382As you can see, adding and producing a diagnostic is pretty straightforward. 383The hard part is deciding exactly what you need to say to help the user, 384picking a suitable wording, and providing the information needed to format it 385correctly. The good news is that the call site that issues a diagnostic should 386be completely independent of how the diagnostic is formatted and in what 387language it is rendered. 388 389Fix-It Hints 390^^^^^^^^^^^^ 391 392In some cases, the front end emits diagnostics when it is clear that some small 393change to the source code would fix the problem. For example, a missing 394semicolon at the end of a statement or a use of deprecated syntax that is 395easily rewritten into a more modern form. Clang tries very hard to emit the 396diagnostic and recover gracefully in these and other cases. 397 398However, for these cases where the fix is obvious, the diagnostic can be 399annotated with a hint (referred to as a "fix-it hint") that describes how to 400change the code referenced by the diagnostic to fix the problem. For example, 401it might add the missing semicolon at the end of the statement or rewrite the 402use of a deprecated construct into something more palatable. Here is one such 403example from the C++ front end, where we warn about the right-shift operator 404changing meaning from C++98 to C++11: 405 406.. code-block:: text 407 408 test.cpp:3:7: warning: use of right-shift operator ('>>') in template argument 409 will require parentheses in C++11 410 A<100 >> 2> *a; 411 ^ 412 ( ) 413 414Here, the fix-it hint is suggesting that parentheses be added, and showing 415exactly where those parentheses would be inserted into the source code. The 416fix-it hints themselves describe what changes to make to the source code in an 417abstract manner, which the text diagnostic printer renders as a line of 418"insertions" below the caret line. :ref:`Other diagnostic clients 419<DiagnosticConsumer>` might choose to render the code differently (e.g., as 420markup inline) or even give the user the ability to automatically fix the 421problem. 422 423Fix-it hints on errors and warnings need to obey these rules: 424 425* Since they are automatically applied if ``-Xclang -fixit`` is passed to the 426 driver, they should only be used when it's very likely they match the user's 427 intent. 428* Clang must recover from errors as if the fix-it had been applied. 429* Fix-it hints on a warning must not change the meaning of the code. 430 However, a hint may clarify the meaning as intentional, for example by adding 431 parentheses when the precedence of operators isn't obvious. 432 433If a fix-it can't obey these rules, put the fix-it on a note. Fix-its on notes 434are not applied automatically. 435 436All fix-it hints are described by the ``FixItHint`` class, instances of which 437should be attached to the diagnostic using the ``<<`` operator in the same way 438that highlighted source ranges and arguments are passed to the diagnostic. 439Fix-it hints can be created with one of three constructors: 440 441* ``FixItHint::CreateInsertion(Loc, Code)`` 442 443 Specifies that the given ``Code`` (a string) should be inserted before the 444 source location ``Loc``. 445 446* ``FixItHint::CreateRemoval(Range)`` 447 448 Specifies that the code in the given source ``Range`` should be removed. 449 450* ``FixItHint::CreateReplacement(Range, Code)`` 451 452 Specifies that the code in the given source ``Range`` should be removed, 453 and replaced with the given ``Code`` string. 454 455.. _DiagnosticConsumer: 456 457The ``DiagnosticConsumer`` Interface 458^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 459 460Once code generates a diagnostic with all of the arguments and the rest of the 461relevant information, Clang needs to know what to do with it. As previously 462mentioned, the diagnostic machinery goes through some filtering to map a 463severity onto a diagnostic level, then (assuming the diagnostic is not mapped 464to "``Ignore``") it invokes an object that implements the ``DiagnosticConsumer`` 465interface with the information. 466 467It is possible to implement this interface in many different ways. For 468example, the normal Clang ``DiagnosticConsumer`` (named 469``TextDiagnosticPrinter``) turns the arguments into strings (according to the 470various formatting rules), prints out the file/line/column information and the 471string, then prints out the line of code, the source ranges, and the caret. 472However, this behavior isn't required. 473 474Another implementation of the ``DiagnosticConsumer`` interface is the 475``TextDiagnosticBuffer`` class, which is used when Clang is in ``-verify`` 476mode. Instead of formatting and printing out the diagnostics, this 477implementation just captures and remembers the diagnostics as they fly by. 478Then ``-verify`` compares the list of produced diagnostics to the list of 479expected ones. If they disagree, it prints out its own output. Full 480documentation for the ``-verify`` mode can be found in the Clang API 481documentation for `VerifyDiagnosticConsumer 482</doxygen/classclang_1_1VerifyDiagnosticConsumer.html#details>`_. 483 484There are many other possible implementations of this interface, and this is 485why we prefer diagnostics to pass down rich structured information in 486arguments. For example, an HTML output might want declaration names be 487linkified to where they come from in the source. Another example is that a GUI 488might let you click on typedefs to expand them. This application would want to 489pass significantly more information about types through to the GUI than a 490simple flat string. The interface allows this to happen. 491 492.. _internals-diag-translation: 493 494Adding Translations to Clang 495^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 496 497Not possible yet! Diagnostic strings should be written in UTF-8, the client can 498translate to the relevant code page if needed. Each translation completely 499replaces the format string for the diagnostic. 500 501.. _SourceLocation: 502.. _SourceManager: 503 504The ``SourceLocation`` and ``SourceManager`` classes 505---------------------------------------------------- 506 507Strangely enough, the ``SourceLocation`` class represents a location within the 508source code of the program. Important design points include: 509 510#. ``sizeof(SourceLocation)`` must be extremely small, as these are embedded 511 into many AST nodes and are passed around often. Currently it is 32 bits. 512#. ``SourceLocation`` must be a simple value object that can be efficiently 513 copied. 514#. We should be able to represent a source location for any byte of any input 515 file. This includes in the middle of tokens, in whitespace, in trigraphs, 516 etc. 517#. A ``SourceLocation`` must encode the current ``#include`` stack that was 518 active when the location was processed. For example, if the location 519 corresponds to a token, it should contain the set of ``#include``\ s active 520 when the token was lexed. This allows us to print the ``#include`` stack 521 for a diagnostic. 522#. ``SourceLocation`` must be able to describe macro expansions, capturing both 523 the ultimate instantiation point and the source of the original character 524 data. 525 526In practice, the ``SourceLocation`` works together with the ``SourceManager`` 527class to encode two pieces of information about a location: its spelling 528location and its expansion location. For most tokens, these will be the 529same. However, for a macro expansion (or tokens that came from a ``_Pragma`` 530directive) these will describe the location of the characters corresponding to 531the token and the location where the token was used (i.e., the macro 532expansion point or the location of the ``_Pragma`` itself). 533 534The Clang front-end inherently depends on the location of a token being tracked 535correctly. If it is ever incorrect, the front-end may get confused and die. 536The reason for this is that the notion of the "spelling" of a ``Token`` in 537Clang depends on being able to find the original input characters for the 538token. This concept maps directly to the "spelling location" for the token. 539 540``SourceRange`` and ``CharSourceRange`` 541--------------------------------------- 542 543.. mostly taken from https://discourse.llvm.org/t/code-ranges-of-tokens-ast-elements/16893/2 544 545Clang represents most source ranges by [first, last], where "first" and "last" 546each point to the beginning of their respective tokens. For example consider 547the ``SourceRange`` of the following statement: 548 549.. code-block:: text 550 551 x = foo + bar; 552 ^first ^last 553 554To map from this representation to a character-based representation, the "last" 555location needs to be adjusted to point to (or past) the end of that token with 556either ``Lexer::MeasureTokenLength()`` or ``Lexer::getLocForEndOfToken()``. For 557the rare cases where character-level source ranges information is needed we use 558the ``CharSourceRange`` class. 559 560The Driver Library 561================== 562 563The clang Driver and library are documented :doc:`here <DriverInternals>`. 564 565Precompiled Headers 566=================== 567 568Clang supports precompiled headers (:doc:`PCH <PCHInternals>`), which uses a 569serialized representation of Clang's internal data structures, encoded with the 570`LLVM bitstream format <https://llvm.org/docs/BitCodeFormat.html>`_. 571 572The Frontend Library 573==================== 574 575The Frontend library contains functionality useful for building tools on top of 576the Clang libraries, for example several methods for outputting diagnostics. 577 578Compiler Invocation 579------------------- 580 581One of the classes provided by the Frontend library is ``CompilerInvocation``, 582which holds information that describe current invocation of the Clang ``-cc1`` 583frontend. The information typically comes from the command line constructed by 584the Clang driver or from clients performing custom initialization. The data 585structure is split into logical units used by different parts of the compiler, 586for example ``PreprocessorOptions``, ``LanguageOptions`` or ``CodeGenOptions``. 587 588Command Line Interface 589---------------------- 590 591The command line interface of the Clang ``-cc1`` frontend is defined alongside 592the driver options in ``clang/Driver/Options.td``. The information making up an 593option definition includes its prefix and name (for example ``-std=``), form and 594position of the option value, help text, aliases and more. Each option may 595belong to a certain group and can be marked with zero or more flags. Options 596accepted by the ``-cc1`` frontend are marked with the ``CC1Option`` flag. 597 598Command Line Parsing 599-------------------- 600 601Option definitions are processed by the ``-gen-opt-parser-defs`` tablegen 602backend during early stages of the build. Options are then used for querying an 603instance ``llvm::opt::ArgList``, a wrapper around the command line arguments. 604This is done in the Clang driver to construct individual jobs based on the 605driver arguments and also in the ``CompilerInvocation::CreateFromArgs`` function 606that parses the ``-cc1`` frontend arguments. 607 608Command Line Generation 609----------------------- 610 611Any valid ``CompilerInvocation`` created from a ``-cc1`` command line can be 612also serialized back into semantically equivalent command line in a 613deterministic manner. This enables features such as implicitly discovered, 614explicitly built modules. 615 616.. 617 TODO: Create and link corresponding section in Modules.rst. 618 619Adding new Command Line Option 620------------------------------ 621 622When adding a new command line option, the first place of interest is the header 623file declaring the corresponding options class (e.g. ``CodeGenOptions.h`` for 624command line option that affects the code generation). Create new member 625variable for the option value: 626 627.. code-block:: diff 628 629 class CodeGenOptions : public CodeGenOptionsBase { 630 631 + /// List of dynamic shared object files to be loaded as pass plugins. 632 + std::vector<std::string> PassPlugins; 633 634 } 635 636Next, declare the command line interface of the option in the tablegen file 637``clang/include/clang/Driver/Options.td``. This is done by instantiating the 638``Option`` class (defined in ``llvm/include/llvm/Option/OptParser.td``). The 639instance is typically created through one of the helper classes that encode the 640acceptable ways to specify the option value on the command line: 641 642* ``Flag`` - the option does not accept any value, 643* ``Joined`` - the value must immediately follow the option name within the same 644 argument, 645* ``Separate`` - the value must follow the option name in the next command line 646 argument, 647* ``JoinedOrSeparate`` - the value can be specified either as ``Joined`` or 648 ``Separate``, 649* ``CommaJoined`` - the values are comma-separated and must immediately follow 650 the option name within the same argument (see ``Wl,`` for an example). 651 652The helper classes take a list of acceptable prefixes of the option (e.g. 653``"-"``, ``"--"`` or ``"/"``) and the option name: 654 655.. code-block:: diff 656 657 // Options.td 658 659 + def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">; 660 661Then, specify additional attributes via mix-ins: 662 663* ``HelpText`` holds the text that will be printed besides the option name when 664 the user requests help (e.g. via ``clang --help``). 665* ``Group`` specifies the "category" of options this option belongs to. This is 666 used by various tools to filter certain options of interest. 667* ``Flags`` may contain a number of "tags" associated with the option. This 668 enables more granular filtering than the ``Group`` attribute. 669* ``Alias`` denotes that the option is an alias of another option. This may be 670 combined with ``AliasArgs`` that holds the implied value. 671 672.. code-block:: diff 673 674 // Options.td 675 676 def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">, 677 + Group<f_Group>, Flags<[CC1Option]>, 678 + HelpText<"Load pass plugin from a dynamic shared object file.">; 679 680New options are recognized by the Clang driver unless marked with the 681``NoDriverOption`` flag. On the other hand, options intended for the ``-cc1`` 682frontend must be explicitly marked with the ``CC1Option`` flag. 683 684Next, parse (or manufacture) the command line arguments in the Clang driver and 685use them to construct the ``-cc1`` job: 686 687.. code-block:: diff 688 689 void Clang::ConstructJob(const ArgList &Args /*...*/) const { 690 ArgStringList CmdArgs; 691 // ... 692 693 + for (const Arg *A : Args.filtered(OPT_fpass_plugin_EQ)) { 694 + CmdArgs.push_back(Args.MakeArgString(Twine("-fpass-plugin=") + A->getValue())); 695 + A->claim(); 696 + } 697 } 698 699The last step is implementing the ``-cc1`` command line argument 700parsing/generation that initializes/serializes the option class (in our case 701``CodeGenOptions``) stored within ``CompilerInvocation``. This can be done 702automatically by using the marshalling annotations on the option definition: 703 704.. code-block:: diff 705 706 // Options.td 707 708 def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">, 709 Group<f_Group>, Flags<[CC1Option]>, 710 HelpText<"Load pass plugin from a dynamic shared object file.">, 711 + MarshallingInfoStringVector<CodeGenOpts<"PassPlugins">>; 712 713Inner workings of the system are introduced in the :ref:`marshalling 714infrastructure <OptionMarshalling>` section and the available annotations are 715listed :ref:`here <OptionMarshallingAnnotations>`. 716 717In case the marshalling infrastructure does not support the desired semantics, 718consider simplifying it to fit the existing model. This makes the command line 719more uniform and reduces the amount of custom, manually written code. Remember 720that the ``-cc1`` command line interface is intended only for Clang developers, 721meaning it does not need to mirror the driver interface, maintain backward 722compatibility or be compatible with GCC. 723 724If the option semantics cannot be encoded via marshalling annotations, you can 725resort to parsing/serializing the command line arguments manually: 726 727.. code-block:: diff 728 729 // CompilerInvocation.cpp 730 731 static bool ParseCodeGenArgs(CodeGenOptions &Opts, ArgList &Args /*...*/) { 732 // ... 733 734 + Opts.PassPlugins = Args.getAllArgValues(OPT_fpass_plugin_EQ); 735 } 736 737 static void GenerateCodeGenArgs(const CodeGenOptions &Opts, 738 SmallVectorImpl<const char *> &Args, 739 CompilerInvocation::StringAllocator SA /*...*/) { 740 // ... 741 742 + for (const std::string &PassPlugin : Opts.PassPlugins) 743 + GenerateArg(Args, OPT_fpass_plugin_EQ, PassPlugin, SA); 744 } 745 746Finally, you can specify the argument on the command line: 747``clang -fpass-plugin=a -fpass-plugin=b`` and use the new member variable as 748desired. 749 750.. code-block:: diff 751 752 void EmitAssemblyHelper::EmitAssemblyWithNewPassManager(/*...*/) { 753 // ... 754 + for (auto &PluginFN : CodeGenOpts.PassPlugins) 755 + if (auto PassPlugin = PassPlugin::Load(PluginFN)) 756 + PassPlugin->registerPassBuilderCallbacks(PB); 757 } 758 759.. _OptionMarshalling: 760 761Option Marshalling Infrastructure 762--------------------------------- 763 764The option marshalling infrastructure automates the parsing of the Clang 765``-cc1`` frontend command line arguments into ``CompilerInvocation`` and their 766generation from ``CompilerInvocation``. The system replaces lots of repetitive 767C++ code with simple, declarative tablegen annotations and it's being used for 768the majority of the ``-cc1`` command line interface. This section provides an 769overview of the system. 770 771**Note:** The marshalling infrastructure is not intended for driver-only 772options. Only options of the ``-cc1`` frontend need to be marshalled to/from 773``CompilerInvocation`` instance. 774 775To read and modify contents of ``CompilerInvocation``, the marshalling system 776uses key paths, which are declared in two steps. First, a tablegen definition 777for the ``CompilerInvocation`` member is created by inheriting from 778``KeyPathAndMacro``: 779 780.. code-block:: text 781 782 // Options.td 783 784 class LangOpts<string field> : KeyPathAndMacro<"LangOpts->", field, "LANG_"> {} 785 // CompilerInvocation member ^^^^^^^^^^ 786 // OPTION_WITH_MARSHALLING prefix ^^^^^ 787 788The first argument to the parent class is the beginning of the key path that 789references the ``CompilerInvocation`` member. This argument ends with ``->`` if 790the member is a pointer type or with ``.`` if it's a value type. The child class 791takes a single parameter ``field`` that is forwarded as the second argument to 792the base class. The child class can then be used like so: 793``LangOpts<"IgnoreExceptions">``, constructing a key path to the field 794``LangOpts->IgnoreExceptions``. The third argument passed to the parent class is 795a string that the tablegen backend uses as a prefix to the 796``OPTION_WITH_MARSHALLING`` macro. Using the key path as a mix-in on an 797``Option`` instance instructs the backend to generate the following code: 798 799.. code-block:: c++ 800 801 // Options.inc 802 803 #ifdef LANG_OPTION_WITH_MARSHALLING 804 LANG_OPTION_WITH_MARSHALLING([...], LangOpts->IgnoreExceptions, [...]) 805 #endif // LANG_OPTION_WITH_MARSHALLING 806 807Such definition can be used used in the function for parsing and generating 808command line: 809 810.. code-block:: c++ 811 812 // clang/lib/Frontend/CompilerInvoation.cpp 813 814 bool CompilerInvocation::ParseLangArgs(LangOptions *LangOpts, ArgList &Args, 815 DiagnosticsEngine &Diags) { 816 bool Success = true; 817 818 #define LANG_OPTION_WITH_MARSHALLING( \ 819 PREFIX_TYPE, NAME, ID, KIND, GROUP, ALIAS, ALIASARGS, FLAGS, PARAM, \ 820 HELPTEXT, METAVAR, VALUES, SPELLING, SHOULD_PARSE, ALWAYS_EMIT, KEYPATH, \ 821 DEFAULT_VALUE, IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, DENORMALIZER, \ 822 MERGER, EXTRACTOR, TABLE_INDEX) \ 823 PARSE_OPTION_WITH_MARSHALLING(Args, Diags, Success, ID, FLAGS, PARAM, \ 824 SHOULD_PARSE, KEYPATH, DEFAULT_VALUE, \ 825 IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, \ 826 MERGER, TABLE_INDEX) 827 #include "clang/Driver/Options.inc" 828 #undef LANG_OPTION_WITH_MARSHALLING 829 830 // ... 831 832 return Success; 833 } 834 835 void CompilerInvocation::GenerateLangArgs(LangOptions *LangOpts, 836 SmallVectorImpl<const char *> &Args, 837 StringAllocator SA) { 838 #define LANG_OPTION_WITH_MARSHALLING( \ 839 PREFIX_TYPE, NAME, ID, KIND, GROUP, ALIAS, ALIASARGS, FLAGS, PARAM, \ 840 HELPTEXT, METAVAR, VALUES, SPELLING, SHOULD_PARSE, ALWAYS_EMIT, KEYPATH, \ 841 DEFAULT_VALUE, IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, DENORMALIZER, \ 842 MERGER, EXTRACTOR, TABLE_INDEX) \ 843 GENERATE_OPTION_WITH_MARSHALLING( \ 844 Args, SA, KIND, FLAGS, SPELLING, ALWAYS_EMIT, KEYPATH, DEFAULT_VALUE, \ 845 IMPLIED_CHECK, IMPLIED_VALUE, DENORMALIZER, EXTRACTOR, TABLE_INDEX) 846 #include "clang/Driver/Options.inc" 847 #undef LANG_OPTION_WITH_MARSHALLING 848 849 // ... 850 } 851 852The ``PARSE_OPTION_WITH_MARSHALLING`` and ``GENERATE_OPTION_WITH_MARSHALLING`` 853macros are defined in ``CompilerInvocation.cpp`` and they implement the generic 854algorithm for parsing and generating command line arguments. 855 856.. _OptionMarshallingAnnotations: 857 858Option Marshalling Annotations 859------------------------------ 860 861How does the tablegen backend know what to put in place of ``[...]`` in the 862generated ``Options.inc``? This is specified by the ``Marshalling`` utilities 863described below. All of them take a key path argument and possibly other 864information required for parsing or generating the command line argument. 865 866**Note:** The marshalling infrastructure is not intended for driver-only 867options. Only options of the ``-cc1`` frontend need to be marshalled to/from 868``CompilerInvocation`` instance. 869 870**Positive Flag** 871 872The key path defaults to ``false`` and is set to ``true`` when the flag is 873present on command line. 874 875.. code-block:: text 876 877 def fignore_exceptions : Flag<["-"], "fignore-exceptions">, Flags<[CC1Option]>, 878 MarshallingInfoFlag<LangOpts<"IgnoreExceptions">>; 879 880**Negative Flag** 881 882The key path defaults to ``true`` and is set to ``false`` when the flag is 883present on command line. 884 885.. code-block:: text 886 887 def fno_verbose_asm : Flag<["-"], "fno-verbose-asm">, Flags<[CC1Option]>, 888 MarshallingInfoNegativeFlag<CodeGenOpts<"AsmVerbose">>; 889 890**Negative and Positive Flag** 891 892The key path defaults to the specified value (``false``, ``true`` or some 893boolean value that's statically unknown in the tablegen file). Then, the key 894path is set to the value associated with the flag that appears last on command 895line. 896 897.. code-block:: text 898 899 defm legacy_pass_manager : BoolOption<"f", "legacy-pass-manager", 900 CodeGenOpts<"LegacyPassManager">, DefaultFalse, 901 PosFlag<SetTrue, [], "Use the legacy pass manager in LLVM">, 902 NegFlag<SetFalse, [], "Use the new pass manager in LLVM">, 903 BothFlags<[CC1Option]>>; 904 905With most such pair of flags, the ``-cc1`` frontend accepts only the flag that 906changes the default key path value. The Clang driver is responsible for 907accepting both and either forwarding the changing flag or discarding the flag 908that would just set the key path to its default. 909 910The first argument to ``BoolOption`` is a prefix that is used to construct the 911full names of both flags. The positive flag would then be named 912``flegacy-pass-manager`` and the negative ``fno-legacy-pass-manager``. 913``BoolOption`` also implies the ``-`` prefix for both flags. It's also possible 914to use ``BoolFOption`` that implies the ``"f"`` prefix and ``Group<f_Group>``. 915The ``PosFlag`` and ``NegFlag`` classes hold the associated boolean value, an 916array of elements passed to the ``Flag`` class and the help text. The optional 917``BothFlags`` class holds an array of ``Flag`` elements that are common for both 918the positive and negative flag and their common help text suffix. 919 920**String** 921 922The key path defaults to the specified string, or an empty one, if omitted. When 923the option appears on the command line, the argument value is simply copied. 924 925.. code-block:: text 926 927 def isysroot : JoinedOrSeparate<["-"], "isysroot">, Flags<[CC1Option]>, 928 MarshallingInfoString<HeaderSearchOpts<"Sysroot">, [{"/"}]>; 929 930**List of Strings** 931 932The key path defaults to an empty ``std::vector<std::string>``. Values specified 933with each appearance of the option on the command line are appended to the 934vector. 935 936.. code-block:: text 937 938 def frewrite_map_file : Separate<["-"], "frewrite-map-file">, Flags<[CC1Option]>, 939 MarshallingInfoStringVector<CodeGenOpts<"RewriteMapFiles">>; 940 941**Integer** 942 943The key path defaults to the specified integer value, or ``0`` if omitted. When 944the option appears on the command line, its value gets parsed by ``llvm::APInt`` 945and the result is assigned to the key path on success. 946 947.. code-block:: text 948 949 def mstack_probe_size : Joined<["-"], "mstack-probe-size=">, Flags<[CC1Option]>, 950 MarshallingInfoInt<CodeGenOpts<"StackProbeSize">, "4096">; 951 952**Enumeration** 953 954The key path defaults to the value specified in ``MarshallingInfoEnum`` prefixed 955by the contents of ``NormalizedValuesScope`` and ``::``. This ensures correct 956reference to an enum case is formed even if the enum resides in different 957namespace or is an enum class. If the value present on command line does not 958match any of the comma-separated values from ``Values``, an error diagnostics is 959issued. Otherwise, the corresponding element from ``NormalizedValues`` at the 960same index is assigned to the key path (also correctly scoped). The number of 961comma-separated string values and elements of the array within 962``NormalizedValues`` must match. 963 964.. code-block:: text 965 966 def mthread_model : Separate<["-"], "mthread-model">, Flags<[CC1Option]>, 967 Values<"posix,single">, NormalizedValues<["POSIX", "Single"]>, 968 NormalizedValuesScope<"LangOptions::ThreadModelKind">, 969 MarshallingInfoEnum<LangOpts<"ThreadModel">, "POSIX">; 970 971.. 972 Intentionally omitting MarshallingInfoBitfieldFlag. It's adding some 973 complexity to the marshalling infrastructure and might be removed. 974 975It is also possible to define relationships between options. 976 977**Implication** 978 979The key path defaults to the default value from the primary ``Marshalling`` 980annotation. Then, if any of the elements of ``ImpliedByAnyOf`` evaluate to true, 981the key path value is changed to the specified value or ``true`` if missing. 982Finally, the command line is parsed according to the primary annotation. 983 984.. code-block:: text 985 986 def fms_extensions : Flag<["-"], "fms-extensions">, Flags<[CC1Option]>, 987 MarshallingInfoFlag<LangOpts<"MicrosoftExt">>, 988 ImpliedByAnyOf<[fms_compatibility.KeyPath], "true">; 989 990**Condition** 991 992The option is parsed only if the expression in ``ShouldParseIf`` evaluates to 993true. 994 995.. code-block:: text 996 997 def fopenmp_enable_irbuilder : Flag<["-"], "fopenmp-enable-irbuilder">, Flags<[CC1Option]>, 998 MarshallingInfoFlag<LangOpts<"OpenMPIRBuilder">>, 999 ShouldParseIf<fopenmp.KeyPath>; 1000 1001The Lexer and Preprocessor Library 1002================================== 1003 1004The Lexer library contains several tightly-connected classes that are involved 1005with the nasty process of lexing and preprocessing C source code. The main 1006interface to this library for outside clients is the large ``Preprocessor`` 1007class. It contains the various pieces of state that are required to coherently 1008read tokens out of a translation unit. 1009 1010The core interface to the ``Preprocessor`` object (once it is set up) is the 1011``Preprocessor::Lex`` method, which returns the next :ref:`Token <Token>` from 1012the preprocessor stream. There are two types of token providers that the 1013preprocessor is capable of reading from: a buffer lexer (provided by the 1014:ref:`Lexer <Lexer>` class) and a buffered token stream (provided by the 1015:ref:`TokenLexer <TokenLexer>` class). 1016 1017.. _Token: 1018 1019The Token class 1020--------------- 1021 1022The ``Token`` class is used to represent a single lexed token. Tokens are 1023intended to be used by the lexer/preprocess and parser libraries, but are not 1024intended to live beyond them (for example, they should not live in the ASTs). 1025 1026Tokens most often live on the stack (or some other location that is efficient 1027to access) as the parser is running, but occasionally do get buffered up. For 1028example, macro definitions are stored as a series of tokens, and the C++ 1029front-end periodically needs to buffer tokens up for tentative parsing and 1030various pieces of look-ahead. As such, the size of a ``Token`` matters. On a 103132-bit system, ``sizeof(Token)`` is currently 16 bytes. 1032 1033Tokens occur in two forms: :ref:`annotation tokens <AnnotationToken>` and 1034normal tokens. Normal tokens are those returned by the lexer, annotation 1035tokens represent semantic information and are produced by the parser, replacing 1036normal tokens in the token stream. Normal tokens contain the following 1037information: 1038 1039* **A SourceLocation** --- This indicates the location of the start of the 1040 token. 1041 1042* **A length** --- This stores the length of the token as stored in the 1043 ``SourceBuffer``. For tokens that include them, this length includes 1044 trigraphs and escaped newlines which are ignored by later phases of the 1045 compiler. By pointing into the original source buffer, it is always possible 1046 to get the original spelling of a token completely accurately. 1047 1048* **IdentifierInfo** --- If a token takes the form of an identifier, and if 1049 identifier lookup was enabled when the token was lexed (e.g., the lexer was 1050 not reading in "raw" mode) this contains a pointer to the unique hash value 1051 for the identifier. Because the lookup happens before keyword 1052 identification, this field is set even for language keywords like "``for``". 1053 1054* **TokenKind** --- This indicates the kind of token as classified by the 1055 lexer. This includes things like ``tok::starequal`` (for the "``*=``" 1056 operator), ``tok::ampamp`` for the "``&&``" token, and keyword values (e.g., 1057 ``tok::kw_for``) for identifiers that correspond to keywords. Note that 1058 some tokens can be spelled multiple ways. For example, C++ supports 1059 "operator keywords", where things like "``and``" are treated exactly like the 1060 "``&&``" operator. In these cases, the kind value is set to ``tok::ampamp``, 1061 which is good for the parser, which doesn't have to consider both forms. For 1062 something that cares about which form is used (e.g., the preprocessor 1063 "stringize" operator) the spelling indicates the original form. 1064 1065* **Flags** --- There are currently four flags tracked by the 1066 lexer/preprocessor system on a per-token basis: 1067 1068 #. **StartOfLine** --- This was the first token that occurred on its input 1069 source line. 1070 #. **LeadingSpace** --- There was a space character either immediately before 1071 the token or transitively before the token as it was expanded through a 1072 macro. The definition of this flag is very closely defined by the 1073 stringizing requirements of the preprocessor. 1074 #. **DisableExpand** --- This flag is used internally to the preprocessor to 1075 represent identifier tokens which have macro expansion disabled. This 1076 prevents them from being considered as candidates for macro expansion ever 1077 in the future. 1078 #. **NeedsCleaning** --- This flag is set if the original spelling for the 1079 token includes a trigraph or escaped newline. Since this is uncommon, 1080 many pieces of code can fast-path on tokens that did not need cleaning. 1081 1082One interesting (and somewhat unusual) aspect of normal tokens is that they 1083don't contain any semantic information about the lexed value. For example, if 1084the token was a pp-number token, we do not represent the value of the number 1085that was lexed (this is left for later pieces of code to decide). 1086Additionally, the lexer library has no notion of typedef names vs variable 1087names: both are returned as identifiers, and the parser is left to decide 1088whether a specific identifier is a typedef or a variable (tracking this 1089requires scope information among other things). The parser can do this 1090translation by replacing tokens returned by the preprocessor with "Annotation 1091Tokens". 1092 1093.. _AnnotationToken: 1094 1095Annotation Tokens 1096----------------- 1097 1098Annotation tokens are tokens that are synthesized by the parser and injected 1099into the preprocessor's token stream (replacing existing tokens) to record 1100semantic information found by the parser. For example, if "``foo``" is found 1101to be a typedef, the "``foo``" ``tok::identifier`` token is replaced with an 1102``tok::annot_typename``. This is useful for a couple of reasons: 1) this makes 1103it easy to handle qualified type names (e.g., "``foo::bar::baz<42>::t``") in 1104C++ as a single "token" in the parser. 2) if the parser backtracks, the 1105reparse does not need to redo semantic analysis to determine whether a token 1106sequence is a variable, type, template, etc. 1107 1108Annotation tokens are created by the parser and reinjected into the parser's 1109token stream (when backtracking is enabled). Because they can only exist in 1110tokens that the preprocessor-proper is done with, it doesn't need to keep 1111around flags like "start of line" that the preprocessor uses to do its job. 1112Additionally, an annotation token may "cover" a sequence of preprocessor tokens 1113(e.g., "``a::b::c``" is five preprocessor tokens). As such, the valid fields 1114of an annotation token are different than the fields for a normal token (but 1115they are multiplexed into the normal ``Token`` fields): 1116 1117* **SourceLocation "Location"** --- The ``SourceLocation`` for the annotation 1118 token indicates the first token replaced by the annotation token. In the 1119 example above, it would be the location of the "``a``" identifier. 1120* **SourceLocation "AnnotationEndLoc"** --- This holds the location of the last 1121 token replaced with the annotation token. In the example above, it would be 1122 the location of the "``c``" identifier. 1123* **void* "AnnotationValue"** --- This contains an opaque object that the 1124 parser gets from ``Sema``. The parser merely preserves the information for 1125 ``Sema`` to later interpret based on the annotation token kind. 1126* **TokenKind "Kind"** --- This indicates the kind of Annotation token this is. 1127 See below for the different valid kinds. 1128 1129Annotation tokens currently come in three kinds: 1130 1131#. **tok::annot_typename**: This annotation token represents a resolved 1132 typename token that is potentially qualified. The ``AnnotationValue`` field 1133 contains the ``QualType`` returned by ``Sema::getTypeName()``, possibly with 1134 source location information attached. 1135#. **tok::annot_cxxscope**: This annotation token represents a C++ scope 1136 specifier, such as "``A::B::``". This corresponds to the grammar 1137 productions "*::*" and "*:: [opt] nested-name-specifier*". The 1138 ``AnnotationValue`` pointer is a ``NestedNameSpecifier *`` returned by the 1139 ``Sema::ActOnCXXGlobalScopeSpecifier`` and 1140 ``Sema::ActOnCXXNestedNameSpecifier`` callbacks. 1141#. **tok::annot_template_id**: This annotation token represents a C++ 1142 template-id such as "``foo<int, 4>``", where "``foo``" is the name of a 1143 template. The ``AnnotationValue`` pointer is a pointer to a ``malloc``'d 1144 ``TemplateIdAnnotation`` object. Depending on the context, a parsed 1145 template-id that names a type might become a typename annotation token (if 1146 all we care about is the named type, e.g., because it occurs in a type 1147 specifier) or might remain a template-id token (if we want to retain more 1148 source location information or produce a new type, e.g., in a declaration of 1149 a class template specialization). template-id annotation tokens that refer 1150 to a type can be "upgraded" to typename annotation tokens by the parser. 1151 1152As mentioned above, annotation tokens are not returned by the preprocessor, 1153they are formed on demand by the parser. This means that the parser has to be 1154aware of cases where an annotation could occur and form it where appropriate. 1155This is somewhat similar to how the parser handles Translation Phase 6 of C99: 1156String Concatenation (see C99 5.1.1.2). In the case of string concatenation, 1157the preprocessor just returns distinct ``tok::string_literal`` and 1158``tok::wide_string_literal`` tokens and the parser eats a sequence of them 1159wherever the grammar indicates that a string literal can occur. 1160 1161In order to do this, whenever the parser expects a ``tok::identifier`` or 1162``tok::coloncolon``, it should call the ``TryAnnotateTypeOrScopeToken`` or 1163``TryAnnotateCXXScopeToken`` methods to form the annotation token. These 1164methods will maximally form the specified annotation tokens and replace the 1165current token with them, if applicable. If the current tokens is not valid for 1166an annotation token, it will remain an identifier or "``::``" token. 1167 1168.. _Lexer: 1169 1170The ``Lexer`` class 1171------------------- 1172 1173The ``Lexer`` class provides the mechanics of lexing tokens out of a source 1174buffer and deciding what they mean. The ``Lexer`` is complicated by the fact 1175that it operates on raw buffers that have not had spelling eliminated (this is 1176a necessity to get decent performance), but this is countered with careful 1177coding as well as standard performance techniques (for example, the comment 1178handling code is vectorized on X86 and PowerPC hosts). 1179 1180The lexer has a couple of interesting modal features: 1181 1182* The lexer can operate in "raw" mode. This mode has several features that 1183 make it possible to quickly lex the file (e.g., it stops identifier lookup, 1184 doesn't specially handle preprocessor tokens, handles EOF differently, etc). 1185 This mode is used for lexing within an "``#if 0``" block, for example. 1186* The lexer can capture and return comments as tokens. This is required to 1187 support the ``-C`` preprocessor mode, which passes comments through, and is 1188 used by the diagnostic checker to identifier expect-error annotations. 1189* The lexer can be in ``ParsingFilename`` mode, which happens when 1190 preprocessing after reading a ``#include`` directive. This mode changes the 1191 parsing of "``<``" to return an "angled string" instead of a bunch of tokens 1192 for each thing within the filename. 1193* When parsing a preprocessor directive (after "``#``") the 1194 ``ParsingPreprocessorDirective`` mode is entered. This changes the parser to 1195 return EOD at a newline. 1196* The ``Lexer`` uses a ``LangOptions`` object to know whether trigraphs are 1197 enabled, whether C++ or ObjC keywords are recognized, etc. 1198 1199In addition to these modes, the lexer keeps track of a couple of other features 1200that are local to a lexed buffer, which change as the buffer is lexed: 1201 1202* The ``Lexer`` uses ``BufferPtr`` to keep track of the current character being 1203 lexed. 1204* The ``Lexer`` uses ``IsAtStartOfLine`` to keep track of whether the next 1205 lexed token will start with its "start of line" bit set. 1206* The ``Lexer`` keeps track of the current "``#if``" directives that are active 1207 (which can be nested). 1208* The ``Lexer`` keeps track of an :ref:`MultipleIncludeOpt 1209 <MultipleIncludeOpt>` object, which is used to detect whether the buffer uses 1210 the standard "``#ifndef XX`` / ``#define XX``" idiom to prevent multiple 1211 inclusion. If a buffer does, subsequent includes can be ignored if the 1212 "``XX``" macro is defined. 1213 1214.. _TokenLexer: 1215 1216The ``TokenLexer`` class 1217------------------------ 1218 1219The ``TokenLexer`` class is a token provider that returns tokens from a list of 1220tokens that came from somewhere else. It typically used for two things: 1) 1221returning tokens from a macro definition as it is being expanded 2) returning 1222tokens from an arbitrary buffer of tokens. The later use is used by 1223``_Pragma`` and will most likely be used to handle unbounded look-ahead for the 1224C++ parser. 1225 1226.. _MultipleIncludeOpt: 1227 1228The ``MultipleIncludeOpt`` class 1229-------------------------------- 1230 1231The ``MultipleIncludeOpt`` class implements a really simple little state 1232machine that is used to detect the standard "``#ifndef XX`` / ``#define XX``" 1233idiom that people typically use to prevent multiple inclusion of headers. If a 1234buffer uses this idiom and is subsequently ``#include``'d, the preprocessor can 1235simply check to see whether the guarding condition is defined or not. If so, 1236the preprocessor can completely ignore the include of the header. 1237 1238.. _Parser: 1239 1240The Parser Library 1241================== 1242 1243This library contains a recursive-descent parser that polls tokens from the 1244preprocessor and notifies a client of the parsing progress. 1245 1246Historically, the parser used to talk to an abstract ``Action`` interface that 1247had virtual methods for parse events, for example ``ActOnBinOp()``. When Clang 1248grew C++ support, the parser stopped supporting general ``Action`` clients -- 1249it now always talks to the :ref:`Sema library <Sema>`. However, the Parser 1250still accesses AST objects only through opaque types like ``ExprResult`` and 1251``StmtResult``. Only :ref:`Sema <Sema>` looks at the AST node contents of these 1252wrappers. 1253 1254.. _AST: 1255 1256The AST Library 1257=============== 1258 1259.. _ASTPhilosophy: 1260 1261Design philosophy 1262----------------- 1263 1264Immutability 1265^^^^^^^^^^^^ 1266 1267Clang AST nodes (types, declarations, statements, expressions, and so on) are 1268generally designed to be immutable once created. This provides a number of key 1269benefits: 1270 1271 * Canonicalization of the "meaning" of nodes is possible as soon as the nodes 1272 are created, and is not invalidated by later addition of more information. 1273 For example, we :ref:`canonicalize types <CanonicalType>`, and use a 1274 canonicalized representation of expressions when determining whether two 1275 function template declarations involving dependent expressions declare the 1276 same entity. 1277 * AST nodes can be reused when they have the same meaning. For example, we 1278 reuse ``Type`` nodes when representing the same type (but maintain separate 1279 ``TypeLoc``\s for each instance where a type is written), and we reuse 1280 non-dependent ``Stmt`` and ``Expr`` nodes across instantiations of a 1281 template. 1282 * Serialization and deserialization of the AST to/from AST files is simpler: 1283 we do not need to track modifications made to AST nodes imported from AST 1284 files and serialize separate "update records". 1285 1286There are unfortunately exceptions to this general approach, such as: 1287 1288 * The first declaration of a redeclarable entity maintains a pointer to the 1289 most recent declaration of that entity, which naturally needs to change as 1290 more declarations are parsed. 1291 * Name lookup tables in declaration contexts change after the namespace 1292 declaration is formed. 1293 * We attempt to maintain only a single declaration for an instantiation of a 1294 template, rather than having distinct declarations for an instantiation of 1295 the declaration versus the definition, so template instantiation often 1296 updates parts of existing declarations. 1297 * Some parts of declarations are required to be instantiated separately (this 1298 includes default arguments and exception specifications), and such 1299 instantiations update the existing declaration. 1300 1301These cases tend to be fragile; mutable AST state should be avoided where 1302possible. 1303 1304As a consequence of this design principle, we typically do not provide setters 1305for AST state. (Some are provided for short-term modifications intended to be 1306used immediately after an AST node is created and before it's "published" as 1307part of the complete AST, or where language semantics require after-the-fact 1308updates.) 1309 1310Faithfulness 1311^^^^^^^^^^^^ 1312 1313The AST intends to provide a representation of the program that is faithful to 1314the original source. We intend for it to be possible to write refactoring tools 1315using only information stored in, or easily reconstructible from, the Clang AST. 1316This means that the AST representation should either not desugar source-level 1317constructs to simpler forms, or -- where made necessary by language semantics 1318or a clear engineering tradeoff -- should desugar minimally and wrap the result 1319in a construct representing the original source form. 1320 1321For example, ``CXXForRangeStmt`` directly represents the syntactic form of a 1322range-based for statement, but also holds a semantic representation of the 1323range declaration and iterator declarations. It does not contain a 1324fully-desugared ``ForStmt``, however. 1325 1326Some AST nodes (for example, ``ParenExpr``) represent only syntax, and others 1327(for example, ``ImplicitCastExpr``) represent only semantics, but most nodes 1328will represent a combination of syntax and associated semantics. Inheritance 1329is typically used when representing different (but related) syntaxes for nodes 1330with the same or similar semantics. 1331 1332.. _Type: 1333 1334The ``Type`` class and its subclasses 1335------------------------------------- 1336 1337The ``Type`` class (and its subclasses) are an important part of the AST. 1338Types are accessed through the ``ASTContext`` class, which implicitly creates 1339and uniques them as they are needed. Types have a couple of non-obvious 1340features: 1) they do not capture type qualifiers like ``const`` or ``volatile`` 1341(see :ref:`QualType <QualType>`), and 2) they implicitly capture typedef 1342information. Once created, types are immutable (unlike decls). 1343 1344Typedefs in C make semantic analysis a bit more complex than it would be without 1345them. The issue is that we want to capture typedef information and represent it 1346in the AST perfectly, but the semantics of operations need to "see through" 1347typedefs. For example, consider this code: 1348 1349.. code-block:: c++ 1350 1351 void func() { 1352 typedef int foo; 1353 foo X, *Y; 1354 typedef foo *bar; 1355 bar Z; 1356 *X; // error 1357 **Y; // error 1358 **Z; // error 1359 } 1360 1361The code above is illegal, and thus we expect there to be diagnostics emitted 1362on the annotated lines. In this example, we expect to get: 1363 1364.. code-block:: text 1365 1366 test.c:6:1: error: indirection requires pointer operand ('foo' invalid) 1367 *X; // error 1368 ^~ 1369 test.c:7:1: error: indirection requires pointer operand ('foo' invalid) 1370 **Y; // error 1371 ^~~ 1372 test.c:8:1: error: indirection requires pointer operand ('foo' invalid) 1373 **Z; // error 1374 ^~~ 1375 1376While this example is somewhat silly, it illustrates the point: we want to 1377retain typedef information where possible, so that we can emit errors about 1378"``std::string``" instead of "``std::basic_string<char, std:...``". Doing this 1379requires properly keeping typedef information (for example, the type of ``X`` 1380is "``foo``", not "``int``"), and requires properly propagating it through the 1381various operators (for example, the type of ``*Y`` is "``foo``", not 1382"``int``"). In order to retain this information, the type of these expressions 1383is an instance of the ``TypedefType`` class, which indicates that the type of 1384these expressions is a typedef for "``foo``". 1385 1386Representing types like this is great for diagnostics, because the 1387user-specified type is always immediately available. There are two problems 1388with this: first, various semantic checks need to make judgements about the 1389*actual structure* of a type, ignoring typedefs. Second, we need an efficient 1390way to query whether two types are structurally identical to each other, 1391ignoring typedefs. The solution to both of these problems is the idea of 1392canonical types. 1393 1394.. _CanonicalType: 1395 1396Canonical Types 1397^^^^^^^^^^^^^^^ 1398 1399Every instance of the ``Type`` class contains a canonical type pointer. For 1400simple types with no typedefs involved (e.g., "``int``", "``int*``", 1401"``int**``"), the type just points to itself. For types that have a typedef 1402somewhere in their structure (e.g., "``foo``", "``foo*``", "``foo**``", 1403"``bar``"), the canonical type pointer points to their structurally equivalent 1404type without any typedefs (e.g., "``int``", "``int*``", "``int**``", and 1405"``int*``" respectively). 1406 1407This design provides a constant time operation (dereferencing the canonical type 1408pointer) that gives us access to the structure of types. For example, we can 1409trivially tell that "``bar``" and "``foo*``" are the same type by dereferencing 1410their canonical type pointers and doing a pointer comparison (they both point 1411to the single "``int*``" type). 1412 1413Canonical types and typedef types bring up some complexities that must be 1414carefully managed. Specifically, the ``isa``/``cast``/``dyn_cast`` operators 1415generally shouldn't be used in code that is inspecting the AST. For example, 1416when type checking the indirection operator (unary "``*``" on a pointer), the 1417type checker must verify that the operand has a pointer type. It would not be 1418correct to check that with "``isa<PointerType>(SubExpr->getType())``", because 1419this predicate would fail if the subexpression had a typedef type. 1420 1421The solution to this problem are a set of helper methods on ``Type``, used to 1422check their properties. In this case, it would be correct to use 1423"``SubExpr->getType()->isPointerType()``" to do the check. This predicate will 1424return true if the *canonical type is a pointer*, which is true any time the 1425type is structurally a pointer type. The only hard part here is remembering 1426not to use the ``isa``/``cast``/``dyn_cast`` operations. 1427 1428The second problem we face is how to get access to the pointer type once we 1429know it exists. To continue the example, the result type of the indirection 1430operator is the pointee type of the subexpression. In order to determine the 1431type, we need to get the instance of ``PointerType`` that best captures the 1432typedef information in the program. If the type of the expression is literally 1433a ``PointerType``, we can return that, otherwise we have to dig through the 1434typedefs to find the pointer type. For example, if the subexpression had type 1435"``foo*``", we could return that type as the result. If the subexpression had 1436type "``bar``", we want to return "``foo*``" (note that we do *not* want 1437"``int*``"). In order to provide all of this, ``Type`` has a 1438``getAsPointerType()`` method that checks whether the type is structurally a 1439``PointerType`` and, if so, returns the best one. If not, it returns a null 1440pointer. 1441 1442This structure is somewhat mystical, but after meditating on it, it will make 1443sense to you :). 1444 1445.. _QualType: 1446 1447The ``QualType`` class 1448---------------------- 1449 1450The ``QualType`` class is designed as a trivial value class that is small, 1451passed by-value and is efficient to query. The idea of ``QualType`` is that it 1452stores the type qualifiers (``const``, ``volatile``, ``restrict``, plus some 1453extended qualifiers required by language extensions) separately from the types 1454themselves. ``QualType`` is conceptually a pair of "``Type*``" and the bits 1455for these type qualifiers. 1456 1457By storing the type qualifiers as bits in the conceptual pair, it is extremely 1458efficient to get the set of qualifiers on a ``QualType`` (just return the field 1459of the pair), add a type qualifier (which is a trivial constant-time operation 1460that sets a bit), and remove one or more type qualifiers (just return a 1461``QualType`` with the bitfield set to empty). 1462 1463Further, because the bits are stored outside of the type itself, we do not need 1464to create duplicates of types with different sets of qualifiers (i.e. there is 1465only a single heap allocated "``int``" type: "``const int``" and "``volatile 1466const int``" both point to the same heap allocated "``int``" type). This 1467reduces the heap size used to represent bits and also means we do not have to 1468consider qualifiers when uniquing types (:ref:`Type <Type>` does not even 1469contain qualifiers). 1470 1471In practice, the two most common type qualifiers (``const`` and ``restrict``) 1472are stored in the low bits of the pointer to the ``Type`` object, together with 1473a flag indicating whether extended qualifiers are present (which must be 1474heap-allocated). This means that ``QualType`` is exactly the same size as a 1475pointer. 1476 1477.. _DeclarationName: 1478 1479Declaration names 1480----------------- 1481 1482The ``DeclarationName`` class represents the name of a declaration in Clang. 1483Declarations in the C family of languages can take several different forms. 1484Most declarations are named by simple identifiers, e.g., "``f``" and "``x``" in 1485the function declaration ``f(int x)``. In C++, declaration names can also name 1486class constructors ("``Class``" in ``struct Class { Class(); }``), class 1487destructors ("``~Class``"), overloaded operator names ("``operator+``"), and 1488conversion functions ("``operator void const *``"). In Objective-C, 1489declaration names can refer to the names of Objective-C methods, which involve 1490the method name and the parameters, collectively called a *selector*, e.g., 1491"``setWidth:height:``". Since all of these kinds of entities --- variables, 1492functions, Objective-C methods, C++ constructors, destructors, and operators 1493--- are represented as subclasses of Clang's common ``NamedDecl`` class, 1494``DeclarationName`` is designed to efficiently represent any kind of name. 1495 1496Given a ``DeclarationName`` ``N``, ``N.getNameKind()`` will produce a value 1497that describes what kind of name ``N`` stores. There are 10 options (all of 1498the names are inside the ``DeclarationName`` class). 1499 1500``Identifier`` 1501 1502 The name is a simple identifier. Use ``N.getAsIdentifierInfo()`` to retrieve 1503 the corresponding ``IdentifierInfo*`` pointing to the actual identifier. 1504 1505``ObjCZeroArgSelector``, ``ObjCOneArgSelector``, ``ObjCMultiArgSelector`` 1506 1507 The name is an Objective-C selector, which can be retrieved as a ``Selector`` 1508 instance via ``N.getObjCSelector()``. The three possible name kinds for 1509 Objective-C reflect an optimization within the ``DeclarationName`` class: 1510 both zero- and one-argument selectors are stored as a masked 1511 ``IdentifierInfo`` pointer, and therefore require very little space, since 1512 zero- and one-argument selectors are far more common than multi-argument 1513 selectors (which use a different structure). 1514 1515``CXXConstructorName`` 1516 1517 The name is a C++ constructor name. Use ``N.getCXXNameType()`` to retrieve 1518 the :ref:`type <QualType>` that this constructor is meant to construct. The 1519 type is always the canonical type, since all constructors for a given type 1520 have the same name. 1521 1522``CXXDestructorName`` 1523 1524 The name is a C++ destructor name. Use ``N.getCXXNameType()`` to retrieve 1525 the :ref:`type <QualType>` whose destructor is being named. This type is 1526 always a canonical type. 1527 1528``CXXConversionFunctionName`` 1529 1530 The name is a C++ conversion function. Conversion functions are named 1531 according to the type they convert to, e.g., "``operator void const *``". 1532 Use ``N.getCXXNameType()`` to retrieve the type that this conversion function 1533 converts to. This type is always a canonical type. 1534 1535``CXXOperatorName`` 1536 1537 The name is a C++ overloaded operator name. Overloaded operators are named 1538 according to their spelling, e.g., "``operator+``" or "``operator new []``". 1539 Use ``N.getCXXOverloadedOperator()`` to retrieve the overloaded operator (a 1540 value of type ``OverloadedOperatorKind``). 1541 1542``CXXLiteralOperatorName`` 1543 1544 The name is a C++11 user defined literal operator. User defined 1545 Literal operators are named according to the suffix they define, 1546 e.g., "``_foo``" for "``operator "" _foo``". Use 1547 ``N.getCXXLiteralIdentifier()`` to retrieve the corresponding 1548 ``IdentifierInfo*`` pointing to the identifier. 1549 1550``CXXUsingDirective`` 1551 1552 The name is a C++ using directive. Using directives are not really 1553 NamedDecls, in that they all have the same name, but they are 1554 implemented as such in order to store them in DeclContext 1555 effectively. 1556 1557``DeclarationName``\ s are cheap to create, copy, and compare. They require 1558only a single pointer's worth of storage in the common cases (identifiers, 1559zero- and one-argument Objective-C selectors) and use dense, uniqued storage 1560for the other kinds of names. Two ``DeclarationName``\ s can be compared for 1561equality (``==``, ``!=``) using a simple bitwise comparison, can be ordered 1562with ``<``, ``>``, ``<=``, and ``>=`` (which provide a lexicographical ordering 1563for normal identifiers but an unspecified ordering for other kinds of names), 1564and can be placed into LLVM ``DenseMap``\ s and ``DenseSet``\ s. 1565 1566``DeclarationName`` instances can be created in different ways depending on 1567what kind of name the instance will store. Normal identifiers 1568(``IdentifierInfo`` pointers) and Objective-C selectors (``Selector``) can be 1569implicitly converted to ``DeclarationNames``. Names for C++ constructors, 1570destructors, conversion functions, and overloaded operators can be retrieved 1571from the ``DeclarationNameTable``, an instance of which is available as 1572``ASTContext::DeclarationNames``. The member functions 1573``getCXXConstructorName``, ``getCXXDestructorName``, 1574``getCXXConversionFunctionName``, and ``getCXXOperatorName``, respectively, 1575return ``DeclarationName`` instances for the four kinds of C++ special function 1576names. 1577 1578.. _DeclContext: 1579 1580Declaration contexts 1581-------------------- 1582 1583Every declaration in a program exists within some *declaration context*, such 1584as a translation unit, namespace, class, or function. Declaration contexts in 1585Clang are represented by the ``DeclContext`` class, from which the various 1586declaration-context AST nodes (``TranslationUnitDecl``, ``NamespaceDecl``, 1587``RecordDecl``, ``FunctionDecl``, etc.) will derive. The ``DeclContext`` class 1588provides several facilities common to each declaration context: 1589 1590Source-centric vs. Semantics-centric View of Declarations 1591 1592 ``DeclContext`` provides two views of the declarations stored within a 1593 declaration context. The source-centric view accurately represents the 1594 program source code as written, including multiple declarations of entities 1595 where present (see the section :ref:`Redeclarations and Overloads 1596 <Redeclarations>`), while the semantics-centric view represents the program 1597 semantics. The two views are kept synchronized by semantic analysis while 1598 the ASTs are being constructed. 1599 1600Storage of declarations within that context 1601 1602 Every declaration context can contain some number of declarations. For 1603 example, a C++ class (represented by ``RecordDecl``) contains various member 1604 functions, fields, nested types, and so on. All of these declarations will 1605 be stored within the ``DeclContext``, and one can iterate over the 1606 declarations via [``DeclContext::decls_begin()``, 1607 ``DeclContext::decls_end()``). This mechanism provides the source-centric 1608 view of declarations in the context. 1609 1610Lookup of declarations within that context 1611 1612 The ``DeclContext`` structure provides efficient name lookup for names within 1613 that declaration context. For example, if ``N`` is a namespace we can look 1614 for the name ``N::f`` using ``DeclContext::lookup``. The lookup itself is 1615 based on a lazily-constructed array (for declaration contexts with a small 1616 number of declarations) or hash table (for declaration contexts with more 1617 declarations). The lookup operation provides the semantics-centric view of 1618 the declarations in the context. 1619 1620Ownership of declarations 1621 1622 The ``DeclContext`` owns all of the declarations that were declared within 1623 its declaration context, and is responsible for the management of their 1624 memory as well as their (de-)serialization. 1625 1626All declarations are stored within a declaration context, and one can query 1627information about the context in which each declaration lives. One can 1628retrieve the ``DeclContext`` that contains a particular ``Decl`` using 1629``Decl::getDeclContext``. However, see the section 1630:ref:`LexicalAndSemanticContexts` for more information about how to interpret 1631this context information. 1632 1633.. _Redeclarations: 1634 1635Redeclarations and Overloads 1636^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1637 1638Within a translation unit, it is common for an entity to be declared several 1639times. For example, we might declare a function "``f``" and then later 1640re-declare it as part of an inlined definition: 1641 1642.. code-block:: c++ 1643 1644 void f(int x, int y, int z = 1); 1645 1646 inline void f(int x, int y, int z) { /* ... */ } 1647 1648The representation of "``f``" differs in the source-centric and 1649semantics-centric views of a declaration context. In the source-centric view, 1650all redeclarations will be present, in the order they occurred in the source 1651code, making this view suitable for clients that wish to see the structure of 1652the source code. In the semantics-centric view, only the most recent "``f``" 1653will be found by the lookup, since it effectively replaces the first 1654declaration of "``f``". 1655 1656(Note that because ``f`` can be redeclared at block scope, or in a friend 1657declaration, etc. it is possible that the declaration of ``f`` found by name 1658lookup will not be the most recent one.) 1659 1660In the semantics-centric view, overloading of functions is represented 1661explicitly. For example, given two declarations of a function "``g``" that are 1662overloaded, e.g., 1663 1664.. code-block:: c++ 1665 1666 void g(); 1667 void g(int); 1668 1669the ``DeclContext::lookup`` operation will return a 1670``DeclContext::lookup_result`` that contains a range of iterators over 1671declarations of "``g``". Clients that perform semantic analysis on a program 1672that is not concerned with the actual source code will primarily use this 1673semantics-centric view. 1674 1675.. _LexicalAndSemanticContexts: 1676 1677Lexical and Semantic Contexts 1678^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1679 1680Each declaration has two potentially different declaration contexts: a 1681*lexical* context, which corresponds to the source-centric view of the 1682declaration context, and a *semantic* context, which corresponds to the 1683semantics-centric view. The lexical context is accessible via 1684``Decl::getLexicalDeclContext`` while the semantic context is accessible via 1685``Decl::getDeclContext``, both of which return ``DeclContext`` pointers. For 1686most declarations, the two contexts are identical. For example: 1687 1688.. code-block:: c++ 1689 1690 class X { 1691 public: 1692 void f(int x); 1693 }; 1694 1695Here, the semantic and lexical contexts of ``X::f`` are the ``DeclContext`` 1696associated with the class ``X`` (itself stored as a ``RecordDecl`` AST node). 1697However, we can now define ``X::f`` out-of-line: 1698 1699.. code-block:: c++ 1700 1701 void X::f(int x = 17) { /* ... */ } 1702 1703This definition of "``f``" has different lexical and semantic contexts. The 1704lexical context corresponds to the declaration context in which the actual 1705declaration occurred in the source code, e.g., the translation unit containing 1706``X``. Thus, this declaration of ``X::f`` can be found by traversing the 1707declarations provided by [``decls_begin()``, ``decls_end()``) in the 1708translation unit. 1709 1710The semantic context of ``X::f`` corresponds to the class ``X``, since this 1711member function is (semantically) a member of ``X``. Lookup of the name ``f`` 1712into the ``DeclContext`` associated with ``X`` will then return the definition 1713of ``X::f`` (including information about the default argument). 1714 1715Transparent Declaration Contexts 1716^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1717 1718In C and C++, there are several contexts in which names that are logically 1719declared inside another declaration will actually "leak" out into the enclosing 1720scope from the perspective of name lookup. The most obvious instance of this 1721behavior is in enumeration types, e.g., 1722 1723.. code-block:: c++ 1724 1725 enum Color { 1726 Red, 1727 Green, 1728 Blue 1729 }; 1730 1731Here, ``Color`` is an enumeration, which is a declaration context that contains 1732the enumerators ``Red``, ``Green``, and ``Blue``. Thus, traversing the list of 1733declarations contained in the enumeration ``Color`` will yield ``Red``, 1734``Green``, and ``Blue``. However, outside of the scope of ``Color`` one can 1735name the enumerator ``Red`` without qualifying the name, e.g., 1736 1737.. code-block:: c++ 1738 1739 Color c = Red; 1740 1741There are other entities in C++ that provide similar behavior. For example, 1742linkage specifications that use curly braces: 1743 1744.. code-block:: c++ 1745 1746 extern "C" { 1747 void f(int); 1748 void g(int); 1749 } 1750 // f and g are visible here 1751 1752For source-level accuracy, we treat the linkage specification and enumeration 1753type as a declaration context in which its enclosed declarations ("``Red``", 1754"``Green``", and "``Blue``"; "``f``" and "``g``") are declared. However, these 1755declarations are visible outside of the scope of the declaration context. 1756 1757These language features (and several others, described below) have roughly the 1758same set of requirements: declarations are declared within a particular lexical 1759context, but the declarations are also found via name lookup in scopes 1760enclosing the declaration itself. This feature is implemented via 1761*transparent* declaration contexts (see 1762``DeclContext::isTransparentContext()``), whose declarations are visible in the 1763nearest enclosing non-transparent declaration context. This means that the 1764lexical context of the declaration (e.g., an enumerator) will be the 1765transparent ``DeclContext`` itself, as will the semantic context, but the 1766declaration will be visible in every outer context up to and including the 1767first non-transparent declaration context (since transparent declaration 1768contexts can be nested). 1769 1770The transparent ``DeclContext``\ s are: 1771 1772* Enumerations (but not C++11 "scoped enumerations"): 1773 1774 .. code-block:: c++ 1775 1776 enum Color { 1777 Red, 1778 Green, 1779 Blue 1780 }; 1781 // Red, Green, and Blue are in scope 1782 1783* C++ linkage specifications: 1784 1785 .. code-block:: c++ 1786 1787 extern "C" { 1788 void f(int); 1789 void g(int); 1790 } 1791 // f and g are in scope 1792 1793* Anonymous unions and structs: 1794 1795 .. code-block:: c++ 1796 1797 struct LookupTable { 1798 bool IsVector; 1799 union { 1800 std::vector<Item> *Vector; 1801 std::set<Item> *Set; 1802 }; 1803 }; 1804 1805 LookupTable LT; 1806 LT.Vector = 0; // Okay: finds Vector inside the unnamed union 1807 1808* C++11 inline namespaces: 1809 1810 .. code-block:: c++ 1811 1812 namespace mylib { 1813 inline namespace debug { 1814 class X; 1815 } 1816 } 1817 mylib::X *xp; // okay: mylib::X refers to mylib::debug::X 1818 1819.. _MultiDeclContext: 1820 1821Multiply-Defined Declaration Contexts 1822^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1823 1824C++ namespaces have the interesting property that 1825the namespace can be defined multiple times, and the declarations provided by 1826each namespace definition are effectively merged (from the semantic point of 1827view). For example, the following two code snippets are semantically 1828indistinguishable: 1829 1830.. code-block:: c++ 1831 1832 // Snippet #1: 1833 namespace N { 1834 void f(); 1835 } 1836 namespace N { 1837 void f(int); 1838 } 1839 1840 // Snippet #2: 1841 namespace N { 1842 void f(); 1843 void f(int); 1844 } 1845 1846In Clang's representation, the source-centric view of declaration contexts will 1847actually have two separate ``NamespaceDecl`` nodes in Snippet #1, each of which 1848is a declaration context that contains a single declaration of "``f``". 1849However, the semantics-centric view provided by name lookup into the namespace 1850``N`` for "``f``" will return a ``DeclContext::lookup_result`` that contains a 1851range of iterators over declarations of "``f``". 1852 1853``DeclContext`` manages multiply-defined declaration contexts internally. The 1854function ``DeclContext::getPrimaryContext`` retrieves the "primary" context for 1855a given ``DeclContext`` instance, which is the ``DeclContext`` responsible for 1856maintaining the lookup table used for the semantics-centric view. Given a 1857DeclContext, one can obtain the set of declaration contexts that are 1858semantically connected to this declaration context, in source order, including 1859this context (which will be the only result, for non-namespace contexts) via 1860``DeclContext::collectAllContexts``. Note that these functions are used 1861internally within the lookup and insertion methods of the ``DeclContext``, so 1862the vast majority of clients can ignore them. 1863 1864Because the same entity can be defined multiple times in different modules, 1865it is also possible for there to be multiple definitions of (for instance) 1866a ``CXXRecordDecl``, all of which describe a definition of the same class. 1867In such a case, only one of those "definitions" is considered by Clang to be 1868the definition of the class, and the others are treated as non-defining 1869declarations that happen to also contain member declarations. Corresponding 1870members in each definition of such multiply-defined classes are identified 1871either by redeclaration chains (if the members are ``Redeclarable``) 1872or by simply a pointer to the canonical declaration (if the declarations 1873are not ``Redeclarable`` -- in that case, a ``Mergeable`` base class is used 1874instead). 1875 1876Error Handling 1877-------------- 1878 1879Clang produces an AST even when the code contains errors. Clang won't generate 1880and optimize code for it, but it's used as parsing continues to detect further 1881errors in the input. Clang-based tools also depend on such ASTs, and IDEs in 1882particular benefit from a high-quality AST for broken code. 1883 1884In presence of errors, clang uses a few error-recovery strategies to present the 1885broken code in the AST: 1886 1887- correcting errors: in cases where clang is confident about the fix, it 1888 provides a FixIt attaching to the error diagnostic and emits a corrected AST 1889 (reflecting the written code with FixIts applied). The advantage of that is to 1890 provide more accurate subsequent diagnostics. Typo correction is a typical 1891 example. 1892- representing invalid node: the invalid node is preserved in the AST in some 1893 form, e.g. when the "declaration" part of the declaration contains semantic 1894 errors, the Decl node is marked as invalid. 1895- dropping invalid node: this often happens for errors that we don’t have 1896 graceful recovery. Prior to Recovery AST, a mismatched-argument function call 1897 expression was dropped though a CallExpr was created for semantic analysis. 1898 1899With these strategies, clang surfaces better diagnostics, and provides AST 1900consumers a rich AST reflecting the written source code as much as possible even 1901for broken code. 1902 1903Recovery AST 1904^^^^^^^^^^^^ 1905 1906The idea of Recovery AST is to use recovery nodes which act as a placeholder to 1907maintain the rough structure of the parsing tree, preserve locations and 1908children but have no language semantics attached to them. 1909 1910For example, consider the following mismatched function call: 1911 1912.. code-block:: c++ 1913 1914 int NoArg(); 1915 void test(int abc) { 1916 NoArg(abc); // oops, mismatched function arguments. 1917 } 1918 1919Without Recovery AST, the invalid function call expression (and its child 1920expressions) would be dropped in the AST: 1921 1922:: 1923 1924 |-FunctionDecl <line:1:1, col:11> NoArg 'int ()' 1925 `-FunctionDecl <line:2:1, line:4:1> test 'void (int)' 1926 |-ParmVarDecl <col:11, col:15> col:15 used abc 'int' 1927 `-CompoundStmt <col:20, line:4:1> 1928 1929 1930With Recovery AST, the AST looks like: 1931 1932:: 1933 1934 |-FunctionDecl <line:1:1, col:11> NoArg 'int ()' 1935 `-FunctionDecl <line:2:1, line:4:1> test 'void (int)' 1936 |-ParmVarDecl <col:11, col:15> used abc 'int' 1937 `-CompoundStmt <col:20, line:4:1> 1938 `-RecoveryExpr <line:3:3, col:12> 'int' contains-errors 1939 |-UnresolvedLookupExpr <col:3> '<overloaded function type>' lvalue (ADL) = 'NoArg' 1940 `-DeclRefExpr <col:9> 'int' lvalue ParmVar 'abc' 'int' 1941 1942 1943An alternative is to use existing Exprs, e.g. CallExpr for the above example. 1944This would capture more call details (e.g. locations of parentheses) and allow 1945it to be treated uniformly with valid CallExprs. However, jamming the data we 1946have into CallExpr forces us to weaken its invariants, e.g. arg count may be 1947wrong. This would introduce a huge burden on consumers of the AST to handle such 1948"impossible" cases. So when we're representing (rather than correcting) errors, 1949we use a distinct recovery node type with extremely weak invariants instead. 1950 1951``RecoveryExpr`` is the only recovery node so far. In practice, broken decls 1952need more detailed semantics preserved (the current ``Invalid`` flag works 1953fairly well), and completely broken statements with interesting internal 1954structure are rare (so dropping the statements is OK). 1955 1956Types and dependence 1957^^^^^^^^^^^^^^^^^^^^ 1958 1959``RecoveryExpr`` is an ``Expr``, so it must have a type. In many cases the true 1960type can't really be known until the code is corrected (e.g. a call to a 1961function that doesn't exist). And it means that we can't properly perform type 1962checks on some containing constructs, such as ``return 42 + unknownFunction()``. 1963 1964To model this, we generalize the concept of dependence from C++ templates to 1965mean dependence on a template parameter or how an error is repaired. The 1966``RecoveryExpr`` ``unknownFunction()`` has the totally unknown type 1967``DependentTy``, and this suppresses type-based analysis in the same way it 1968would inside a template. 1969 1970In cases where we are confident about the concrete type (e.g. the return type 1971for a broken non-overloaded function call), the ``RecoveryExpr`` will have this 1972type. This allows more code to be typechecked, and produces a better AST and 1973more diagnostics. For example: 1974 1975.. code-block:: C++ 1976 1977 unknownFunction().size() // .size() is a CXXDependentScopeMemberExpr 1978 std::string(42).size() // .size() is a resolved MemberExpr 1979 1980Whether or not the ``RecoveryExpr`` has a dependent type, it is always 1981considered value-dependent, because its value isn't well-defined until the error 1982is resolved. Among other things, this means that clang doesn't emit more errors 1983where a RecoveryExpr is used as a constant (e.g. array size), but also won't try 1984to evaluate it. 1985 1986ContainsErrors bit 1987^^^^^^^^^^^^^^^^^^ 1988 1989Beyond the template dependence bits, we add a new “ContainsErrors” bit to 1990express “Does this expression or anything within it contain errors” semantic, 1991this bit is always set for RecoveryExpr, and propagated to other related nodes. 1992This provides a fast way to query whether any (recursive) child of an expression 1993had an error, which is often used to improve diagnostics. 1994 1995.. code-block:: C++ 1996 1997 // C++ 1998 void recoveryExpr(int abc) { 1999 unknownFunction(); // type-dependent, value-dependent, contains-errors 2000 2001 std::string(42).size(); // value-dependent, contains-errors, 2002 // not type-dependent, as we know the type is std::string 2003 } 2004 2005 2006.. code-block:: C 2007 2008 // C 2009 void recoveryExpr(int abc) { 2010 unknownVar + abc; // type-dependent, value-dependent, contains-errors 2011 } 2012 2013 2014The ASTImporter 2015--------------- 2016 2017The ``ASTImporter`` class imports nodes of an ``ASTContext`` into another 2018``ASTContext``. Please refer to the document :doc:`ASTImporter: Merging Clang 2019ASTs <LibASTImporter>` for an introduction. And please read through the 2020high-level `description of the import algorithm 2021<LibASTImporter.html#algorithm-of-the-import>`_, this is essential for 2022understanding further implementation details of the importer. 2023 2024.. _templated: 2025 2026Abstract Syntax Graph 2027^^^^^^^^^^^^^^^^^^^^^ 2028 2029Despite the name, the Clang AST is not a tree. It is a directed graph with 2030cycles. One example of a cycle is the connection between a 2031``ClassTemplateDecl`` and its "templated" ``CXXRecordDecl``. The *templated* 2032``CXXRecordDecl`` represents all the fields and methods inside the class 2033template, while the ``ClassTemplateDecl`` holds the information which is 2034related to being a template, i.e. template arguments, etc. We can get the 2035*templated* class (the ``CXXRecordDecl``) of a ``ClassTemplateDecl`` with 2036``ClassTemplateDecl::getTemplatedDecl()``. And we can get back a pointer of the 2037"described" class template from the *templated* class: 2038``CXXRecordDecl::getDescribedTemplate()``. So, this is a cycle between two 2039nodes: between the *templated* and the *described* node. There may be various 2040other kinds of cycles in the AST especially in case of declarations. 2041 2042.. _structural-eq: 2043 2044Structural Equivalency 2045^^^^^^^^^^^^^^^^^^^^^^ 2046 2047Importing one AST node copies that node into the destination ``ASTContext``. To 2048copy one node means that we create a new node in the "to" context then we set 2049its properties to be equal to the properties of the source node. Before the 2050copy, we make sure that the source node is not *structurally equivalent* to any 2051existing node in the destination context. If it happens to be equivalent then 2052we skip the copy. 2053 2054The informal definition of structural equivalency is the following: 2055Two nodes are **structurally equivalent** if they are 2056 2057- builtin types and refer to the same type, e.g. ``int`` and ``int`` are 2058 structurally equivalent, 2059- function types and all their parameters have structurally equivalent types, 2060- record types and all their fields in order of their definition have the same 2061 identifier names and structurally equivalent types, 2062- variable or function declarations and they have the same identifier name and 2063 their types are structurally equivalent. 2064 2065In C, two types are structurally equivalent if they are *compatible types*. For 2066a formal definition of *compatible types*, please refer to 6.2.7/1 in the C11 2067standard. However, there is no definition for *compatible types* in the C++ 2068standard. Still, we extend the definition of structural equivalency to 2069templates and their instantiations similarly: besides checking the previously 2070mentioned properties, we have to check for equivalent template 2071parameters/arguments, etc. 2072 2073The structural equivalent check can be and is used independently from the 2074ASTImporter, e.g. the ``clang::Sema`` class uses it also. 2075 2076The equivalence of nodes may depend on the equivalency of other pairs of nodes. 2077Thus, the check is implemented as a parallel graph traversal. We traverse 2078through the nodes of both graphs at the same time. The actual implementation is 2079similar to breadth-first-search. Let's say we start the traverse with the <A,B> 2080pair of nodes. Whenever the traversal reaches a pair <X,Y> then the following 2081statements are true: 2082 2083- A and X are nodes from the same ASTContext. 2084- B and Y are nodes from the same ASTContext. 2085- A and B may or may not be from the same ASTContext. 2086- if A == X and B == Y (pointer equivalency) then (there is a cycle during the 2087 traverse) 2088 2089 - A and B are structurally equivalent if and only if 2090 2091 - All dependent nodes on the path from <A,B> to <X,Y> are structurally 2092 equivalent. 2093 2094When we compare two classes or enums and one of them is incomplete or has 2095unloaded external lexical declarations then we cannot descend to compare their 2096contained declarations. So in these cases they are considered equal if they 2097have the same names. This is the way how we compare forward declarations with 2098definitions. 2099 2100.. TODO Should we elaborate the actual implementation of the graph traversal, 2101.. which is a very weird BFS traversal? 2102 2103Redeclaration Chains 2104^^^^^^^^^^^^^^^^^^^^ 2105 2106The early version of the ``ASTImporter``'s merge mechanism squashed the 2107declarations, i.e. it aimed to have only one declaration instead of maintaining 2108a whole redeclaration chain. This early approach simply skipped importing a 2109function prototype, but it imported a definition. To demonstrate the problem 2110with this approach let's consider an empty "to" context and the following 2111``virtual`` function declarations of ``f`` in the "from" context: 2112 2113.. code-block:: c++ 2114 2115 struct B { virtual void f(); }; 2116 void B::f() {} // <-- let's import this definition 2117 2118If we imported the definition with the "squashing" approach then we would 2119end-up having one declaration which is indeed a definition, but ``isVirtual()`` 2120returns ``false`` for it. The reason is that the definition is indeed not 2121virtual, it is the property of the prototype! 2122 2123Consequently, we must either set the virtual flag for the definition (but then 2124we create a malformed AST which the parser would never create), or we import 2125the whole redeclaration chain of the function. The most recent version of the 2126``ASTImporter`` uses the latter mechanism. We do import all function 2127declarations - regardless if they are definitions or prototypes - in the order 2128as they appear in the "from" context. 2129 2130.. One definition 2131 2132If we have an existing definition in the "to" context, then we cannot import 2133another definition, we will use the existing definition. However, we can import 2134prototype(s): we chain the newly imported prototype(s) to the existing 2135definition. Whenever we import a new prototype from a third context, that will 2136be added to the end of the redeclaration chain. This may result in long 2137redeclaration chains in certain cases, e.g. if we import from several 2138translation units which include the same header with the prototype. 2139 2140.. Squashing prototypes 2141 2142To mitigate the problem of long redeclaration chains of free functions, we 2143could compare prototypes to see if they have the same properties and if yes 2144then we could merge these prototypes. The implementation of squashing of 2145prototypes for free functions is future work. 2146 2147.. Exception: Cannot have more than 1 prototype in-class 2148 2149Chaining functions this way ensures that we do copy all information from the 2150source AST. Nonetheless, there is a problem with member functions: While we can 2151have many prototypes for free functions, we must have only one prototype for a 2152member function. 2153 2154.. code-block:: c++ 2155 2156 void f(); // OK 2157 void f(); // OK 2158 2159 struct X { 2160 void f(); // OK 2161 void f(); // ERROR 2162 }; 2163 void X::f() {} // OK 2164 2165Thus, prototypes of member functions must be squashed, we cannot just simply 2166attach a new prototype to the existing in-class prototype. Consider the 2167following contexts: 2168 2169.. code-block:: c++ 2170 2171 // "to" context 2172 struct X { 2173 void f(); // D0 2174 }; 2175 2176.. code-block:: c++ 2177 2178 // "from" context 2179 struct X { 2180 void f(); // D1 2181 }; 2182 void X::f() {} // D2 2183 2184When we import the prototype and the definition of ``f`` from the "from" 2185context, then the resulting redecl chain will look like this ``D0 -> D2'``, 2186where ``D2'`` is the copy of ``D2`` in the "to" context. 2187 2188.. Redecl chains of other declarations 2189 2190Generally speaking, when we import declarations (like enums and classes) we do 2191attach the newly imported declaration to the existing redeclaration chain (if 2192there is structural equivalency). We do not import, however, the whole 2193redeclaration chain as we do in case of functions. Up till now, we haven't 2194found any essential property of forward declarations which is similar to the 2195case of the virtual flag in a member function prototype. In the future, this 2196may change, though. 2197 2198Traversal during the Import 2199^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2200 2201The node specific import mechanisms are implemented in 2202``ASTNodeImporter::VisitNode()`` functions, e.g. ``VisitFunctionDecl()``. 2203When we import a declaration then first we import everything which is needed to 2204call the constructor of that declaration node. Everything which can be set 2205later is set after the node is created. For example, in case of a 2206``FunctionDecl`` we first import the declaration context in which the function 2207is declared, then we create the ``FunctionDecl`` and only then we import the 2208body of the function. This means there are implicit dependencies between AST 2209nodes. These dependencies determine the order in which we visit nodes in the 2210"from" context. As with the regular graph traversal algorithms like DFS, we 2211keep track which nodes we have already visited in 2212``ASTImporter::ImportedDecls``. Whenever we create a node then we immediately 2213add that to the ``ImportedDecls``. We must not start the import of any other 2214declarations before we keep track of the newly created one. This is essential, 2215otherwise, we would not be able to handle circular dependencies. To enforce 2216this, we wrap all constructor calls of all AST nodes in 2217``GetImportedOrCreateDecl()``. This wrapper ensures that all newly created 2218declarations are immediately marked as imported; also, if a declaration is 2219already marked as imported then we just return its counterpart in the "to" 2220context. Consequently, calling a declaration's ``::Create()`` function directly 2221would lead to errors, please don't do that! 2222 2223Even with the use of ``GetImportedOrCreateDecl()`` there is still a 2224probability of having an infinite import recursion if things are imported from 2225each other in wrong way. Imagine that during the import of ``A``, the import of 2226``B`` is requested before we could create the node for ``A`` (the constructor 2227needs a reference to ``B``). And the same could be true for the import of ``B`` 2228(``A`` is requested to be imported before we could create the node for ``B``). 2229In case of the :ref:`templated-described swing <templated>` we take 2230extra attention to break the cyclical dependency: we import and set the 2231described template only after the ``CXXRecordDecl`` is created. As a best 2232practice, before creating the node in the "to" context, avoid importing of 2233other nodes which are not needed for the constructor of node ``A``. 2234 2235Error Handling 2236^^^^^^^^^^^^^^ 2237 2238Every import function returns with either an ``llvm::Error`` or an 2239``llvm::Expected<T>`` object. This enforces to check the return value of the 2240import functions. If there was an error during one import then we return with 2241that error. (Exception: when we import the members of a class, we collect the 2242individual errors with each member and we concatenate them in one Error 2243object.) We cache these errors in cases of declarations. During the next import 2244call if there is an existing error we just return with that. So, clients of the 2245library receive an Error object, which they must check. 2246 2247During import of a specific declaration, it may happen that some AST nodes had 2248already been created before we recognize an error. In this case, we signal back 2249the error to the caller, but the "to" context remains polluted with those nodes 2250which had been created. Ideally, those nodes should not had been created, but 2251that time we did not know about the error, the error happened later. Since the 2252AST is immutable (most of the cases we can't remove existing nodes) we choose 2253to mark these nodes as erroneous. 2254 2255We cache the errors associated with declarations in the "from" context in 2256``ASTImporter::ImportDeclErrors`` and the ones which are associated with the 2257"to" context in ``ASTImporterSharedState::ImportErrors``. Note that, there may 2258be several ASTImporter objects which import into the same "to" context but from 2259different "from" contexts; in this case, they have to share the associated 2260errors of the "to" context. 2261 2262When an error happens, that propagates through the call stack, through all the 2263dependant nodes. However, in case of dependency cycles, this is not enough, 2264because we strive to mark the erroneous nodes so clients can act upon. In those 2265cases, we have to keep track of the errors for those nodes which are 2266intermediate nodes of a cycle. 2267 2268An **import path** is the list of the AST nodes which we visit during an Import 2269call. If node ``A`` depends on node ``B`` then the path contains an ``A->B`` 2270edge. From the call stack of the import functions, we can read the very same 2271path. 2272 2273Now imagine the following AST, where the ``->`` represents dependency in terms 2274of the import (all nodes are declarations). 2275 2276.. code-block:: text 2277 2278 A->B->C->D 2279 `->E 2280 2281We would like to import A. 2282The import behaves like a DFS, so we will visit the nodes in this order: ABCDE. 2283During the visitation we will have the following import paths: 2284 2285.. code-block:: text 2286 2287 A 2288 AB 2289 ABC 2290 ABCD 2291 ABC 2292 AB 2293 ABE 2294 AB 2295 A 2296 2297If during the visit of E there is an error then we set an error for E, then as 2298the call stack shrinks for B, then for A: 2299 2300.. code-block:: text 2301 2302 A 2303 AB 2304 ABC 2305 ABCD 2306 ABC 2307 AB 2308 ABE // Error! Set an error to E 2309 AB // Set an error to B 2310 A // Set an error to A 2311 2312However, during the import we could import C and D without any error and they 2313are independent of A,B and E. We must not set up an error for C and D. So, at 2314the end of the import we have an entry in ``ImportDeclErrors`` for A,B,E but 2315not for C,D. 2316 2317Now, what happens if there is a cycle in the import path? Let's consider this 2318AST: 2319 2320.. code-block:: text 2321 2322 A->B->C->A 2323 `->E 2324 2325During the visitation, we will have the below import paths and if during the 2326visit of E there is an error then we will set up an error for E,B,A. But what's 2327up with C? 2328 2329.. code-block:: text 2330 2331 A 2332 AB 2333 ABC 2334 ABCA 2335 ABC 2336 AB 2337 ABE // Error! Set an error to E 2338 AB // Set an error to B 2339 A // Set an error to A 2340 2341This time we know that both B and C are dependent on A. This means we must set 2342up an error for C too. As the call stack reverses back we get to A and we must 2343set up an error to all nodes which depend on A (this includes C). But C is no 2344longer on the import path, it just had been previously. Such a situation can 2345happen only if during the visitation we had a cycle. If we didn't have any 2346cycle, then the normal way of passing an Error object through the call stack 2347could handle the situation. This is why we must track cycles during the import 2348process for each visited declaration. 2349 2350Lookup Problems 2351^^^^^^^^^^^^^^^ 2352 2353When we import a declaration from the source context then we check whether we 2354already have a structurally equivalent node with the same name in the "to" 2355context. If the "from" node is a definition and the found one is also a 2356definition, then we do not create a new node, instead, we mark the found node 2357as the imported node. If the found definition and the one we want to import 2358have the same name but they are structurally in-equivalent, then we have an ODR 2359violation in case of C++. If the "from" node is not a definition then we add 2360that to the redeclaration chain of the found node. This behaviour is essential 2361when we merge ASTs from different translation units which include the same 2362header file(s). For example, we want to have only one definition for the class 2363template ``std::vector``, even if we included ``<vector>`` in several 2364translation units. 2365 2366To find a structurally equivalent node we can use the regular C/C++ lookup 2367functions: ``DeclContext::noload_lookup()`` and 2368``DeclContext::localUncachedLookup()``. These functions do respect the C/C++ 2369name hiding rules, thus you cannot find certain declarations in a given 2370declaration context. For instance, unnamed declarations (anonymous structs), 2371non-first ``friend`` declarations and template specializations are hidden. This 2372is a problem, because if we use the regular C/C++ lookup then we create 2373redundant AST nodes during the merge! Also, having two instances of the same 2374node could result in false :ref:`structural in-equivalencies <structural-eq>` 2375of other nodes which depend on the duplicated node. Because of these reasons, 2376we created a lookup class which has the sole purpose to register all 2377declarations, so later they can be looked up by subsequent import requests. 2378This is the ``ASTImporterLookupTable`` class. This lookup table should be 2379shared amongst the different ``ASTImporter`` instances if they happen to import 2380to the very same "to" context. This is why we can use the importer specific 2381lookup only via the ``ASTImporterSharedState`` class. 2382 2383ExternalASTSource 2384~~~~~~~~~~~~~~~~~ 2385 2386The ``ExternalASTSource`` is an abstract interface associated with the 2387``ASTContext`` class. It provides the ability to read the declarations stored 2388within a declaration context either for iteration or for name lookup. A 2389declaration context with an external AST source may load its declarations 2390on-demand. This means that the list of declarations (represented as a linked 2391list, the head is ``DeclContext::FirstDecl``) could be empty. However, member 2392functions like ``DeclContext::lookup()`` may initiate a load. 2393 2394Usually, external sources are associated with precompiled headers. For example, 2395when we load a class from a PCH then the members are loaded only if we do want 2396to look up something in the class' context. 2397 2398In case of LLDB, an implementation of the ``ExternalASTSource`` interface is 2399attached to the AST context which is related to the parsed expression. This 2400implementation of the ``ExternalASTSource`` interface is realized with the help 2401of the ``ASTImporter`` class. This way, LLDB can reuse Clang's parsing 2402machinery while synthesizing the underlying AST from the debug data (e.g. from 2403DWARF). From the view of the ``ASTImporter`` this means both the "to" and the 2404"from" context may have declaration contexts with external lexical storage. If 2405a ``DeclContext`` in the "to" AST context has external lexical storage then we 2406must take extra attention to work only with the already loaded declarations! 2407Otherwise, we would end up with an uncontrolled import process. For instance, 2408if we used the regular ``DeclContext::lookup()`` to find the existing 2409declarations in the "to" context then the ``lookup()`` call itself would 2410initiate a new import while we are in the middle of importing a declaration! 2411(By the time we initiate the lookup we haven't registered yet that we already 2412started to import the node of the "from" context.) This is why we use 2413``DeclContext::noload_lookup()`` instead. 2414 2415Class Template Instantiations 2416^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2417 2418Different translation units may have class template instantiations with the 2419same template arguments, but with a different set of instantiated 2420``MethodDecls`` and ``FieldDecls``. Consider the following files: 2421 2422.. code-block:: c++ 2423 2424 // x.h 2425 template <typename T> 2426 struct X { 2427 int a{0}; // FieldDecl with InitListExpr 2428 X(char) : a(3) {} // (1) 2429 X(int) {} // (2) 2430 }; 2431 2432 // foo.cpp 2433 void foo() { 2434 // ClassTemplateSpec with ctor (1): FieldDecl without InitlistExpr 2435 X<char> xc('c'); 2436 } 2437 2438 // bar.cpp 2439 void bar() { 2440 // ClassTemplateSpec with ctor (2): FieldDecl WITH InitlistExpr 2441 X<char> xc(1); 2442 } 2443 2444In ``foo.cpp`` we use the constructor with number ``(1)``, which explicitly 2445initializes the member ``a`` to ``3``, thus the ``InitListExpr`` ``{0}`` is not 2446used here and the AST node is not instantiated. However, in the case of 2447``bar.cpp`` we use the constructor with number ``(2)``, which does not 2448explicitly initialize the ``a`` member, so the default ``InitListExpr`` is 2449needed and thus instantiated. When we merge the AST of ``foo.cpp`` and 2450``bar.cpp`` we must create an AST node for the class template instantiation of 2451``X<char>`` which has all the required nodes. Therefore, when we find an 2452existing ``ClassTemplateSpecializationDecl`` then we merge the fields of the 2453``ClassTemplateSpecializationDecl`` in the "from" context in a way that the 2454``InitListExpr`` is copied if not existent yet. The same merge mechanism should 2455be done in the cases of instantiated default arguments and exception 2456specifications of functions. 2457 2458.. _visibility: 2459 2460Visibility of Declarations 2461^^^^^^^^^^^^^^^^^^^^^^^^^^ 2462 2463During import of a global variable with external visibility, the lookup will 2464find variables (with the same name) but with static visibility (linkage). 2465Clearly, we cannot put them into the same redeclaration chain. The same is true 2466the in case of functions. Also, we have to take care of other kinds of 2467declarations like enums, classes, etc. if they are in anonymous namespaces. 2468Therefore, we filter the lookup results and consider only those which have the 2469same visibility as the declaration we currently import. 2470 2471We consider two declarations in two anonymous namespaces to have the same 2472visibility only if they are imported from the same AST context. 2473 2474Strategies to Handle Conflicting Names 2475^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2476 2477During the import we lookup existing declarations with the same name. We filter 2478the lookup results based on their :ref:`visibility <visibility>`. If any of the 2479found declarations are not structurally equivalent then we bumped to a name 2480conflict error (ODR violation in C++). In this case, we return with an 2481``Error`` and we set up the ``Error`` object for the declaration. However, some 2482clients of the ``ASTImporter`` may require a different, perhaps less 2483conservative and more liberal error handling strategy. 2484 2485E.g. static analysis clients may benefit if the node is created even if there 2486is a name conflict. During the CTU analysis of certain projects, we recognized 2487that there are global declarations which collide with declarations from other 2488translation units, but they are not referenced outside from their translation 2489unit. These declarations should be in an unnamed namespace ideally. If we treat 2490these collisions liberally then CTU analysis can find more results. Note, the 2491feature be able to choose between name conflict handling strategies is still an 2492ongoing work. 2493 2494.. _CFG: 2495 2496The ``CFG`` class 2497----------------- 2498 2499The ``CFG`` class is designed to represent a source-level control-flow graph 2500for a single statement (``Stmt*``). Typically instances of ``CFG`` are 2501constructed for function bodies (usually an instance of ``CompoundStmt``), but 2502can also be instantiated to represent the control-flow of any class that 2503subclasses ``Stmt``, which includes simple expressions. Control-flow graphs 2504are especially useful for performing `flow- or path-sensitive 2505<https://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities>`_ program 2506analyses on a given function. 2507 2508Basic Blocks 2509^^^^^^^^^^^^ 2510 2511Concretely, an instance of ``CFG`` is a collection of basic blocks. Each basic 2512block is an instance of ``CFGBlock``, which simply contains an ordered sequence 2513of ``Stmt*`` (each referring to statements in the AST). The ordering of 2514statements within a block indicates unconditional flow of control from one 2515statement to the next. :ref:`Conditional control-flow 2516<ConditionalControlFlow>` is represented using edges between basic blocks. The 2517statements within a given ``CFGBlock`` can be traversed using the 2518``CFGBlock::*iterator`` interface. 2519 2520A ``CFG`` object owns the instances of ``CFGBlock`` within the control-flow 2521graph it represents. Each ``CFGBlock`` within a CFG is also uniquely numbered 2522(accessible via ``CFGBlock::getBlockID()``). Currently the number is based on 2523the ordering the blocks were created, but no assumptions should be made on how 2524``CFGBlocks`` are numbered other than their numbers are unique and that they 2525are numbered from 0..N-1 (where N is the number of basic blocks in the CFG). 2526 2527Entry and Exit Blocks 2528^^^^^^^^^^^^^^^^^^^^^ 2529 2530Each instance of ``CFG`` contains two special blocks: an *entry* block 2531(accessible via ``CFG::getEntry()``), which has no incoming edges, and an 2532*exit* block (accessible via ``CFG::getExit()``), which has no outgoing edges. 2533Neither block contains any statements, and they serve the role of providing a 2534clear entrance and exit for a body of code such as a function body. The 2535presence of these empty blocks greatly simplifies the implementation of many 2536analyses built on top of CFGs. 2537 2538.. _ConditionalControlFlow: 2539 2540Conditional Control-Flow 2541^^^^^^^^^^^^^^^^^^^^^^^^ 2542 2543Conditional control-flow (such as those induced by if-statements and loops) is 2544represented as edges between ``CFGBlocks``. Because different C language 2545constructs can induce control-flow, each ``CFGBlock`` also records an extra 2546``Stmt*`` that represents the *terminator* of the block. A terminator is 2547simply the statement that caused the control-flow, and is used to identify the 2548nature of the conditional control-flow between blocks. For example, in the 2549case of an if-statement, the terminator refers to the ``IfStmt`` object in the 2550AST that represented the given branch. 2551 2552To illustrate, consider the following code example: 2553 2554.. code-block:: c++ 2555 2556 int foo(int x) { 2557 x = x + 1; 2558 if (x > 2) 2559 x++; 2560 else { 2561 x += 2; 2562 x *= 2; 2563 } 2564 2565 return x; 2566 } 2567 2568After invoking the parser+semantic analyzer on this code fragment, the AST of 2569the body of ``foo`` is referenced by a single ``Stmt*``. We can then construct 2570an instance of ``CFG`` representing the control-flow graph of this function 2571body by single call to a static class method: 2572 2573.. code-block:: c++ 2574 2575 Stmt *FooBody = ... 2576 std::unique_ptr<CFG> FooCFG = CFG::buildCFG(FooBody); 2577 2578Along with providing an interface to iterate over its ``CFGBlocks``, the 2579``CFG`` class also provides methods that are useful for debugging and 2580visualizing CFGs. For example, the method ``CFG::dump()`` dumps a 2581pretty-printed version of the CFG to standard error. This is especially useful 2582when one is using a debugger such as gdb. For example, here is the output of 2583``FooCFG->dump()``: 2584 2585.. code-block:: text 2586 2587 [ B5 (ENTRY) ] 2588 Predecessors (0): 2589 Successors (1): B4 2590 2591 [ B4 ] 2592 1: x = x + 1 2593 2: (x > 2) 2594 T: if [B4.2] 2595 Predecessors (1): B5 2596 Successors (2): B3 B2 2597 2598 [ B3 ] 2599 1: x++ 2600 Predecessors (1): B4 2601 Successors (1): B1 2602 2603 [ B2 ] 2604 1: x += 2 2605 2: x *= 2 2606 Predecessors (1): B4 2607 Successors (1): B1 2608 2609 [ B1 ] 2610 1: return x; 2611 Predecessors (2): B2 B3 2612 Successors (1): B0 2613 2614 [ B0 (EXIT) ] 2615 Predecessors (1): B1 2616 Successors (0): 2617 2618For each block, the pretty-printed output displays for each block the number of 2619*predecessor* blocks (blocks that have outgoing control-flow to the given 2620block) and *successor* blocks (blocks that have control-flow that have incoming 2621control-flow from the given block). We can also clearly see the special entry 2622and exit blocks at the beginning and end of the pretty-printed output. For the 2623entry block (block B5), the number of predecessor blocks is 0, while for the 2624exit block (block B0) the number of successor blocks is 0. 2625 2626The most interesting block here is B4, whose outgoing control-flow represents 2627the branching caused by the sole if-statement in ``foo``. Of particular 2628interest is the second statement in the block, ``(x > 2)``, and the terminator, 2629printed as ``if [B4.2]``. The second statement represents the evaluation of 2630the condition of the if-statement, which occurs before the actual branching of 2631control-flow. Within the ``CFGBlock`` for B4, the ``Stmt*`` for the second 2632statement refers to the actual expression in the AST for ``(x > 2)``. Thus 2633pointers to subclasses of ``Expr`` can appear in the list of statements in a 2634block, and not just subclasses of ``Stmt`` that refer to proper C statements. 2635 2636The terminator of block B4 is a pointer to the ``IfStmt`` object in the AST. 2637The pretty-printer outputs ``if [B4.2]`` because the condition expression of 2638the if-statement has an actual place in the basic block, and thus the 2639terminator is essentially *referring* to the expression that is the second 2640statement of block B4 (i.e., B4.2). In this manner, conditions for 2641control-flow (which also includes conditions for loops and switch statements) 2642are hoisted into the actual basic block. 2643 2644.. Implicit Control-Flow 2645.. ^^^^^^^^^^^^^^^^^^^^^ 2646 2647.. A key design principle of the ``CFG`` class was to not require any 2648.. transformations to the AST in order to represent control-flow. Thus the 2649.. ``CFG`` does not perform any "lowering" of the statements in an AST: loops 2650.. are not transformed into guarded gotos, short-circuit operations are not 2651.. converted to a set of if-statements, and so on. 2652 2653Constant Folding in the Clang AST 2654--------------------------------- 2655 2656There are several places where constants and constant folding matter a lot to 2657the Clang front-end. First, in general, we prefer the AST to retain the source 2658code as close to how the user wrote it as possible. This means that if they 2659wrote "``5+4``", we want to keep the addition and two constants in the AST, we 2660don't want to fold to "``9``". This means that constant folding in various 2661ways turns into a tree walk that needs to handle the various cases. 2662 2663However, there are places in both C and C++ that require constants to be 2664folded. For example, the C standard defines what an "integer constant 2665expression" (i-c-e) is with very precise and specific requirements. The 2666language then requires i-c-e's in a lot of places (for example, the size of a 2667bitfield, the value for a case statement, etc). For these, we have to be able 2668to constant fold the constants, to do semantic checks (e.g., verify bitfield 2669size is non-negative and that case statements aren't duplicated). We aim for 2670Clang to be very pedantic about this, diagnosing cases when the code does not 2671use an i-c-e where one is required, but accepting the code unless running with 2672``-pedantic-errors``. 2673 2674Things get a little bit more tricky when it comes to compatibility with 2675real-world source code. Specifically, GCC has historically accepted a huge 2676superset of expressions as i-c-e's, and a lot of real world code depends on 2677this unfortunate accident of history (including, e.g., the glibc system 2678headers). GCC accepts anything its "fold" optimizer is capable of reducing to 2679an integer constant, which means that the definition of what it accepts changes 2680as its optimizer does. One example is that GCC accepts things like "``case 2681X-X:``" even when ``X`` is a variable, because it can fold this to 0. 2682 2683Another issue are how constants interact with the extensions we support, such 2684as ``__builtin_constant_p``, ``__builtin_inf``, ``__extension__`` and many 2685others. C99 obviously does not specify the semantics of any of these 2686extensions, and the definition of i-c-e does not include them. However, these 2687extensions are often used in real code, and we have to have a way to reason 2688about them. 2689 2690Finally, this is not just a problem for semantic analysis. The code generator 2691and other clients have to be able to fold constants (e.g., to initialize global 2692variables) and have to handle a superset of what C99 allows. Further, these 2693clients can benefit from extended information. For example, we know that 2694"``foo() || 1``" always evaluates to ``true``, but we can't replace the 2695expression with ``true`` because it has side effects. 2696 2697Implementation Approach 2698^^^^^^^^^^^^^^^^^^^^^^^ 2699 2700After trying several different approaches, we've finally converged on a design 2701(Note, at the time of this writing, not all of this has been implemented, 2702consider this a design goal!). Our basic approach is to define a single 2703recursive evaluation method (``Expr::Evaluate``), which is implemented 2704in ``AST/ExprConstant.cpp``. Given an expression with "scalar" type (integer, 2705fp, complex, or pointer) this method returns the following information: 2706 2707* Whether the expression is an integer constant expression, a general constant 2708 that was folded but has no side effects, a general constant that was folded 2709 but that does have side effects, or an uncomputable/unfoldable value. 2710* If the expression was computable in any way, this method returns the 2711 ``APValue`` for the result of the expression. 2712* If the expression is not evaluatable at all, this method returns information 2713 on one of the problems with the expression. This includes a 2714 ``SourceLocation`` for where the problem is, and a diagnostic ID that explains 2715 the problem. The diagnostic should have ``ERROR`` type. 2716* If the expression is not an integer constant expression, this method returns 2717 information on one of the problems with the expression. This includes a 2718 ``SourceLocation`` for where the problem is, and a diagnostic ID that 2719 explains the problem. The diagnostic should have ``EXTENSION`` type. 2720 2721This information gives various clients the flexibility that they want, and we 2722will eventually have some helper methods for various extensions. For example, 2723``Sema`` should have a ``Sema::VerifyIntegerConstantExpression`` method, which 2724calls ``Evaluate`` on the expression. If the expression is not foldable, the 2725error is emitted, and it would return ``true``. If the expression is not an 2726i-c-e, the ``EXTENSION`` diagnostic is emitted. Finally it would return 2727``false`` to indicate that the AST is OK. 2728 2729Other clients can use the information in other ways, for example, codegen can 2730just use expressions that are foldable in any way. 2731 2732Extensions 2733^^^^^^^^^^ 2734 2735This section describes how some of the various extensions Clang supports 2736interacts with constant evaluation: 2737 2738* ``__extension__``: The expression form of this extension causes any 2739 evaluatable subexpression to be accepted as an integer constant expression. 2740* ``__builtin_constant_p``: This returns true (as an integer constant 2741 expression) if the operand evaluates to either a numeric value (that is, not 2742 a pointer cast to integral type) of integral, enumeration, floating or 2743 complex type, or if it evaluates to the address of the first character of a 2744 string literal (possibly cast to some other type). As a special case, if 2745 ``__builtin_constant_p`` is the (potentially parenthesized) condition of a 2746 conditional operator expression ("``?:``"), only the true side of the 2747 conditional operator is considered, and it is evaluated with full constant 2748 folding. 2749* ``__builtin_choose_expr``: The condition is required to be an integer 2750 constant expression, but we accept any constant as an "extension of an 2751 extension". This only evaluates one operand depending on which way the 2752 condition evaluates. 2753* ``__builtin_classify_type``: This always returns an integer constant 2754 expression. 2755* ``__builtin_inf, nan, ...``: These are treated just like a floating-point 2756 literal. 2757* ``__builtin_abs, copysign, ...``: These are constant folded as general 2758 constant expressions. 2759* ``__builtin_strlen`` and ``strlen``: These are constant folded as integer 2760 constant expressions if the argument is a string literal. 2761 2762.. _Sema: 2763 2764The Sema Library 2765================ 2766 2767This library is called by the :ref:`Parser library <Parser>` during parsing to 2768do semantic analysis of the input. For valid programs, Sema builds an AST for 2769parsed constructs. 2770 2771.. _CodeGen: 2772 2773The CodeGen Library 2774=================== 2775 2776CodeGen takes an :ref:`AST <AST>` as input and produces `LLVM IR code 2777<//llvm.org/docs/LangRef.html>`_ from it. 2778 2779How to change Clang 2780=================== 2781 2782How to add an attribute 2783----------------------- 2784Attributes are a form of metadata that can be attached to a program construct, 2785allowing the programmer to pass semantic information along to the compiler for 2786various uses. For example, attributes may be used to alter the code generation 2787for a program construct, or to provide extra semantic information for static 2788analysis. This document explains how to add a custom attribute to Clang. 2789Documentation on existing attributes can be found `here 2790<//clang.llvm.org/docs/AttributeReference.html>`_. 2791 2792Attribute Basics 2793^^^^^^^^^^^^^^^^ 2794Attributes in Clang are handled in three stages: parsing into a parsed attribute 2795representation, conversion from a parsed attribute into a semantic attribute, 2796and then the semantic handling of the attribute. 2797 2798Parsing of the attribute is determined by the various syntactic forms attributes 2799can take, such as GNU, C++11, and Microsoft style attributes, as well as other 2800information provided by the table definition of the attribute. Ultimately, the 2801parsed representation of an attribute object is an ``ParsedAttr`` object. 2802These parsed attributes chain together as a list of parsed attributes attached 2803to a declarator or declaration specifier. The parsing of attributes is handled 2804automatically by Clang, except for attributes spelled as keywords. When 2805implementing a keyword attribute, the parsing of the keyword and creation of the 2806``ParsedAttr`` object must be done manually. 2807 2808Eventually, ``Sema::ProcessDeclAttributeList()`` is called with a ``Decl`` and 2809a ``ParsedAttr``, at which point the parsed attribute can be transformed 2810into a semantic attribute. The process by which a parsed attribute is converted 2811into a semantic attribute depends on the attribute definition and semantic 2812requirements of the attribute. The end result, however, is that the semantic 2813attribute object is attached to the ``Decl`` object, and can be obtained by a 2814call to ``Decl::getAttr<T>()``. Similarly, for statement attributes, 2815``Sema::ProcessStmtAttributes()`` is called with a ``Stmt`` a list of 2816``ParsedAttr`` objects to be converted into a semantic attribute. 2817 2818The structure of the semantic attribute is also governed by the attribute 2819definition given in Attr.td. This definition is used to automatically generate 2820functionality used for the implementation of the attribute, such as a class 2821derived from ``clang::Attr``, information for the parser to use, automated 2822semantic checking for some attributes, etc. 2823 2824 2825``include/clang/Basic/Attr.td`` 2826^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2827The first step to adding a new attribute to Clang is to add its definition to 2828`include/clang/Basic/Attr.td 2829<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/Attr.td>`_. 2830This tablegen definition must derive from the ``Attr`` (tablegen, not 2831semantic) type, or one of its derivatives. Most attributes will derive from the 2832``InheritableAttr`` type, which specifies that the attribute can be inherited by 2833later redeclarations of the ``Decl`` it is associated with. 2834``InheritableParamAttr`` is similar to ``InheritableAttr``, except that the 2835attribute is written on a parameter instead of a declaration. If the attribute 2836applies to statements, it should inherit from ``StmtAttr``. If the attribute is 2837intended to apply to a type instead of a declaration, such an attribute should 2838derive from ``TypeAttr``, and will generally not be given an AST representation. 2839(Note that this document does not cover the creation of type attributes.) An 2840attribute that inherits from ``IgnoredAttr`` is parsed, but will generate an 2841ignored attribute diagnostic when used, which may be useful when an attribute is 2842supported by another vendor but not supported by clang. 2843 2844The definition will specify several key pieces of information, such as the 2845semantic name of the attribute, the spellings the attribute supports, the 2846arguments the attribute expects, and more. Most members of the ``Attr`` tablegen 2847type do not require definitions in the derived definition as the default 2848suffice. However, every attribute must specify at least a spelling list, a 2849subject list, and a documentation list. 2850 2851Spellings 2852~~~~~~~~~ 2853All attributes are required to specify a spelling list that denotes the ways in 2854which the attribute can be spelled. For instance, a single semantic attribute 2855may have a keyword spelling, as well as a C++11 spelling and a GNU spelling. An 2856empty spelling list is also permissible and may be useful for attributes which 2857are created implicitly. The following spellings are accepted: 2858 2859 ============ ================================================================ 2860 Spelling Description 2861 ============ ================================================================ 2862 ``GNU`` Spelled with a GNU-style ``__attribute__((attr))`` syntax and 2863 placement. 2864 ``CXX11`` Spelled with a C++-style ``[[attr]]`` syntax with an optional 2865 vendor-specific namespace. 2866 ``C2x`` Spelled with a C-style ``[[attr]]`` syntax with an optional 2867 vendor-specific namespace. 2868 ``Declspec`` Spelled with a Microsoft-style ``__declspec(attr)`` syntax. 2869 ``Keyword`` The attribute is spelled as a keyword, and required custom 2870 parsing. 2871 ``GCC`` Specifies two or three spellings: the first is a GNU-style 2872 spelling, the second is a C++-style spelling with the ``gnu`` 2873 namespace, and the third is an optional C-style spelling with 2874 the ``gnu`` namespace. Attributes should only specify this 2875 spelling for attributes supported by GCC. 2876 ``Clang`` Specifies two or three spellings: the first is a GNU-style 2877 spelling, the second is a C++-style spelling with the ``clang`` 2878 namespace, and the third is an optional C-style spelling with 2879 the ``clang`` namespace. By default, a C-style spelling is 2880 provided. 2881 ``Pragma`` The attribute is spelled as a ``#pragma``, and requires custom 2882 processing within the preprocessor. If the attribute is meant to 2883 be used by Clang, it should set the namespace to ``"clang"``. 2884 Note that this spelling is not used for declaration attributes. 2885 ============ ================================================================ 2886 2887Subjects 2888~~~~~~~~ 2889Attributes appertain to one or more subjects. If the attribute attempts to 2890attach to a subject that is not in the subject list, a diagnostic is issued 2891automatically. Whether the diagnostic is a warning or an error depends on how 2892the attribute's ``SubjectList`` is defined, but the default behavior is to warn. 2893The diagnostics displayed to the user are automatically determined based on the 2894subjects in the list, but a custom diagnostic parameter can also be specified in 2895the ``SubjectList``. The diagnostics generated for subject list violations are 2896calculated automatically or specified by the subject list itself. If a 2897previously unused Decl node is added to the ``SubjectList``, the logic used to 2898automatically determine the diagnostic parameter in `utils/TableGen/ClangAttrEmitter.cpp 2899<https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/ClangAttrEmitter.cpp>`_ 2900may need to be updated. 2901 2902By default, all subjects in the SubjectList must either be a Decl node defined 2903in ``DeclNodes.td``, or a statement node defined in ``StmtNodes.td``. However, 2904more complex subjects can be created by creating a ``SubsetSubject`` object. 2905Each such object has a base subject which it appertains to (which must be a 2906Decl or Stmt node, and not a SubsetSubject node), and some custom code which is 2907called when determining whether an attribute appertains to the subject. For 2908instance, a ``NonBitField`` SubsetSubject appertains to a ``FieldDecl``, and 2909tests whether the given FieldDecl is a bit field. When a SubsetSubject is 2910specified in a SubjectList, a custom diagnostic parameter must also be provided. 2911 2912Diagnostic checking for attribute subject lists for declaration and statement 2913attributes is automated except when ``HasCustomParsing`` is set to ``1``. 2914 2915Documentation 2916~~~~~~~~~~~~~ 2917All attributes must have some form of documentation associated with them. 2918Documentation is table generated on the public web server by a server-side 2919process that runs daily. Generally, the documentation for an attribute is a 2920stand-alone definition in `include/clang/Basic/AttrDocs.td 2921<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/AttrDocs.td>`_ 2922that is named after the attribute being documented. 2923 2924If the attribute is not for public consumption, or is an implicitly-created 2925attribute that has no visible spelling, the documentation list can specify the 2926``InternalOnly`` object. Otherwise, the attribute should have its documentation 2927added to AttrDocs.td. 2928 2929Documentation derives from the ``Documentation`` tablegen type. All derived 2930types must specify a documentation category and the actual documentation itself. 2931Additionally, it can specify a custom heading for the attribute, though a 2932default heading will be chosen when possible. 2933 2934There are four predefined documentation categories: ``DocCatFunction`` for 2935attributes that appertain to function-like subjects, ``DocCatVariable`` for 2936attributes that appertain to variable-like subjects, ``DocCatType`` for type 2937attributes, and ``DocCatStmt`` for statement attributes. A custom documentation 2938category should be used for groups of attributes with similar functionality. 2939Custom categories are good for providing overview information for the attributes 2940grouped under it. For instance, the consumed annotation attributes define a 2941custom category, ``DocCatConsumed``, that explains what consumed annotations are 2942at a high level. 2943 2944Documentation content (whether it is for an attribute or a category) is written 2945using reStructuredText (RST) syntax. 2946 2947After writing the documentation for the attribute, it should be locally tested 2948to ensure that there are no issues generating the documentation on the server. 2949Local testing requires a fresh build of clang-tblgen. To generate the attribute 2950documentation, execute the following command:: 2951 2952 clang-tblgen -gen-attr-docs -I /path/to/clang/include /path/to/clang/include/clang/Basic/Attr.td -o /path/to/clang/docs/AttributeReference.rst 2953 2954When testing locally, *do not* commit changes to ``AttributeReference.rst``. 2955This file is generated by the server automatically, and any changes made to this 2956file will be overwritten. 2957 2958Arguments 2959~~~~~~~~~ 2960Attributes may optionally specify a list of arguments that can be passed to the 2961attribute. Attribute arguments specify both the parsed form and the semantic 2962form of the attribute. For example, if ``Args`` is 2963``[StringArgument<"Arg1">, IntArgument<"Arg2">]`` then 2964``__attribute__((myattribute("Hello", 3)))`` will be a valid use; it requires 2965two arguments while parsing, and the Attr subclass' constructor for the 2966semantic attribute will require a string and integer argument. 2967 2968All arguments have a name and a flag that specifies whether the argument is 2969optional. The associated C++ type of the argument is determined by the argument 2970definition type. If the existing argument types are insufficient, new types can 2971be created, but it requires modifying `utils/TableGen/ClangAttrEmitter.cpp 2972<https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/ClangAttrEmitter.cpp>`_ 2973to properly support the type. 2974 2975Other Properties 2976~~~~~~~~~~~~~~~~ 2977The ``Attr`` definition has other members which control the behavior of the 2978attribute. Many of them are special-purpose and beyond the scope of this 2979document, however a few deserve mention. 2980 2981If the parsed form of the attribute is more complex, or differs from the 2982semantic form, the ``HasCustomParsing`` bit can be set to ``1`` for the class, 2983and the parsing code in `Parser::ParseGNUAttributeArgs() 2984<https://github.com/llvm/llvm-project/blob/main/clang/lib/Parse/ParseDecl.cpp>`_ 2985can be updated for the special case. Note that this only applies to arguments 2986with a GNU spelling -- attributes with a __declspec spelling currently ignore 2987this flag and are handled by ``Parser::ParseMicrosoftDeclSpec``. 2988 2989Note that setting this member to 1 will opt out of common attribute semantic 2990handling, requiring extra implementation efforts to ensure the attribute 2991appertains to the appropriate subject, etc. 2992 2993If the attribute should not be propagated from a template declaration to an 2994instantiation of the template, set the ``Clone`` member to 0. By default, all 2995attributes will be cloned to template instantiations. 2996 2997Attributes that do not require an AST node should set the ``ASTNode`` field to 2998``0`` to avoid polluting the AST. Note that anything inheriting from 2999``TypeAttr`` or ``IgnoredAttr`` automatically do not generate an AST node. All 3000other attributes generate an AST node by default. The AST node is the semantic 3001representation of the attribute. 3002 3003The ``LangOpts`` field specifies a list of language options required by the 3004attribute. For instance, all of the CUDA-specific attributes specify ``[CUDA]`` 3005for the ``LangOpts`` field, and when the CUDA language option is not enabled, an 3006"attribute ignored" warning diagnostic is emitted. Since language options are 3007not table generated nodes, new language options must be created manually and 3008should specify the spelling used by ``LangOptions`` class. 3009 3010Custom accessors can be generated for an attribute based on the spelling list 3011for that attribute. For instance, if an attribute has two different spellings: 3012'Foo' and 'Bar', accessors can be created: 3013``[Accessor<"isFoo", [GNU<"Foo">]>, Accessor<"isBar", [GNU<"Bar">]>]`` 3014These accessors will be generated on the semantic form of the attribute, 3015accepting no arguments and returning a ``bool``. 3016 3017Attributes that do not require custom semantic handling should set the 3018``SemaHandler`` field to ``0``. Note that anything inheriting from 3019``IgnoredAttr`` automatically do not get a semantic handler. All other 3020attributes are assumed to use a semantic handler by default. Attributes 3021without a semantic handler are not given a parsed attribute ``Kind`` enumerator. 3022 3023"Simple" attributes, that require no custom semantic processing aside from what 3024is automatically provided, should set the ``SimpleHandler`` field to ``1``. 3025 3026Target-specific attributes may share a spelling with other attributes in 3027different targets. For instance, the ARM and MSP430 targets both have an 3028attribute spelled ``GNU<"interrupt">``, but with different parsing and semantic 3029requirements. To support this feature, an attribute inheriting from 3030``TargetSpecificAttribute`` may specify a ``ParseKind`` field. This field 3031should be the same value between all arguments sharing a spelling, and 3032corresponds to the parsed attribute's ``Kind`` enumerator. This allows 3033attributes to share a parsed attribute kind, but have distinct semantic 3034attribute classes. For instance, ``ParsedAttr`` is the shared 3035parsed attribute kind, but ARMInterruptAttr and MSP430InterruptAttr are the 3036semantic attributes generated. 3037 3038By default, attribute arguments are parsed in an evaluated context. If the 3039arguments for an attribute should be parsed in an unevaluated context (akin to 3040the way the argument to a ``sizeof`` expression is parsed), set 3041``ParseArgumentsAsUnevaluated`` to ``1``. 3042 3043If additional functionality is desired for the semantic form of the attribute, 3044the ``AdditionalMembers`` field specifies code to be copied verbatim into the 3045semantic attribute class object, with ``public`` access. 3046 3047If two or more attributes cannot be used in combination on the same declaration 3048or statement, a ``MutualExclusions`` definition can be supplied to automatically 3049generate diagnostic code. This will disallow the attribute combinations 3050regardless of spellings used. Additionally, it will diagnose combinations within 3051the same attribute list, different attribute list, and redeclarations, as 3052appropriate. 3053 3054Boilerplate 3055^^^^^^^^^^^ 3056All semantic processing of declaration attributes happens in `lib/Sema/SemaDeclAttr.cpp 3057<https://github.com/llvm/llvm-project/blob/main/clang/lib/Sema/SemaDeclAttr.cpp>`_, 3058and generally starts in the ``ProcessDeclAttribute()`` function. If the 3059attribute has the ``SimpleHandler`` field set to ``1`` then the function to 3060process the attribute will be automatically generated, and nothing needs to be 3061done here. Otherwise, write a new ``handleYourAttr()`` function, and add that to 3062the switch statement. Please do not implement handling logic directly in the 3063``case`` for the attribute. 3064 3065Unless otherwise specified by the attribute definition, common semantic checking 3066of the parsed attribute is handled automatically. This includes diagnosing 3067parsed attributes that do not appertain to the given ``Decl`` or ``Stmt``, 3068ensuring the correct minimum number of arguments are passed, etc. 3069 3070If the attribute adds additional warnings, define a ``DiagGroup`` in 3071`include/clang/Basic/DiagnosticGroups.td 3072<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/DiagnosticGroups.td>`_ 3073named after the attribute's ``Spelling`` with "_"s replaced by "-"s. If there 3074is only a single diagnostic, it is permissible to use ``InGroup<DiagGroup<"your-attribute">>`` 3075directly in `DiagnosticSemaKinds.td 3076<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/DiagnosticSemaKinds.td>`_ 3077 3078All semantic diagnostics generated for your attribute, including automatically- 3079generated ones (such as subjects and argument counts), should have a 3080corresponding test case. 3081 3082Semantic handling 3083^^^^^^^^^^^^^^^^^ 3084Most attributes are implemented to have some effect on the compiler. For 3085instance, to modify the way code is generated, or to add extra semantic checks 3086for an analysis pass, etc. Having added the attribute definition and conversion 3087to the semantic representation for the attribute, what remains is to implement 3088the custom logic requiring use of the attribute. 3089 3090The ``clang::Decl`` object can be queried for the presence or absence of an 3091attribute using ``hasAttr<T>()``. To obtain a pointer to the semantic 3092representation of the attribute, ``getAttr<T>`` may be used. 3093 3094The ``clang::AttributedStmt`` object can be queried for the presence or absence 3095of an attribute by calling ``getAttrs()`` and looping over the list of 3096attributes. 3097 3098How to add an expression or statement 3099------------------------------------- 3100 3101Expressions and statements are one of the most fundamental constructs within a 3102compiler, because they interact with many different parts of the AST, semantic 3103analysis, and IR generation. Therefore, adding a new expression or statement 3104kind into Clang requires some care. The following list details the various 3105places in Clang where an expression or statement needs to be introduced, along 3106with patterns to follow to ensure that the new expression or statement works 3107well across all of the C languages. We focus on expressions, but statements 3108are similar. 3109 3110#. Introduce parsing actions into the parser. Recursive-descent parsing is 3111 mostly self-explanatory, but there are a few things that are worth keeping 3112 in mind: 3113 3114 * Keep as much source location information as possible! You'll want it later 3115 to produce great diagnostics and support Clang's various features that map 3116 between source code and the AST. 3117 * Write tests for all of the "bad" parsing cases, to make sure your recovery 3118 is good. If you have matched delimiters (e.g., parentheses, square 3119 brackets, etc.), use ``Parser::BalancedDelimiterTracker`` to give nice 3120 diagnostics when things go wrong. 3121 3122#. Introduce semantic analysis actions into ``Sema``. Semantic analysis should 3123 always involve two functions: an ``ActOnXXX`` function that will be called 3124 directly from the parser, and a ``BuildXXX`` function that performs the 3125 actual semantic analysis and will (eventually!) build the AST node. It's 3126 fairly common for the ``ActOnCXX`` function to do very little (often just 3127 some minor translation from the parser's representation to ``Sema``'s 3128 representation of the same thing), but the separation is still important: 3129 C++ template instantiation, for example, should always call the ``BuildXXX`` 3130 variant. Several notes on semantic analysis before we get into construction 3131 of the AST: 3132 3133 * Your expression probably involves some types and some subexpressions. 3134 Make sure to fully check that those types, and the types of those 3135 subexpressions, meet your expectations. Add implicit conversions where 3136 necessary to make sure that all of the types line up exactly the way you 3137 want them. Write extensive tests to check that you're getting good 3138 diagnostics for mistakes and that you can use various forms of 3139 subexpressions with your expression. 3140 * When type-checking a type or subexpression, make sure to first check 3141 whether the type is "dependent" (``Type::isDependentType()``) or whether a 3142 subexpression is type-dependent (``Expr::isTypeDependent()``). If any of 3143 these return ``true``, then you're inside a template and you can't do much 3144 type-checking now. That's normal, and your AST node (when you get there) 3145 will have to deal with this case. At this point, you can write tests that 3146 use your expression within templates, but don't try to instantiate the 3147 templates. 3148 * For each subexpression, be sure to call ``Sema::CheckPlaceholderExpr()`` 3149 to deal with "weird" expressions that don't behave well as subexpressions. 3150 Then, determine whether you need to perform lvalue-to-rvalue conversions 3151 (``Sema::DefaultLvalueConversions``) or the usual unary conversions 3152 (``Sema::UsualUnaryConversions``), for places where the subexpression is 3153 producing a value you intend to use. 3154 * Your ``BuildXXX`` function will probably just return ``ExprError()`` at 3155 this point, since you don't have an AST. That's perfectly fine, and 3156 shouldn't impact your testing. 3157 3158#. Introduce an AST node for your new expression. This starts with declaring 3159 the node in ``include/Basic/StmtNodes.td`` and creating a new class for your 3160 expression in the appropriate ``include/AST/Expr*.h`` header. It's best to 3161 look at the class for a similar expression to get ideas, and there are some 3162 specific things to watch for: 3163 3164 * If you need to allocate memory, use the ``ASTContext`` allocator to 3165 allocate memory. Never use raw ``malloc`` or ``new``, and never hold any 3166 resources in an AST node, because the destructor of an AST node is never 3167 called. 3168 * Make sure that ``getSourceRange()`` covers the exact source range of your 3169 expression. This is needed for diagnostics and for IDE support. 3170 * Make sure that ``children()`` visits all of the subexpressions. This is 3171 important for a number of features (e.g., IDE support, C++ variadic 3172 templates). If you have sub-types, you'll also need to visit those 3173 sub-types in ``RecursiveASTVisitor``. 3174 * Add printing support (``StmtPrinter.cpp``) for your expression. 3175 * Add profiling support (``StmtProfile.cpp``) for your AST node, noting the 3176 distinguishing (non-source location) characteristics of an instance of 3177 your expression. Omitting this step will lead to hard-to-diagnose 3178 failures regarding matching of template declarations. 3179 * Add serialization support (``ASTReaderStmt.cpp``, ``ASTWriterStmt.cpp``) 3180 for your AST node. 3181 3182#. Teach semantic analysis to build your AST node. At this point, you can wire 3183 up your ``Sema::BuildXXX`` function to actually create your AST. A few 3184 things to check at this point: 3185 3186 * If your expression can construct a new C++ class or return a new 3187 Objective-C object, be sure to update and then call 3188 ``Sema::MaybeBindToTemporary`` for your just-created AST node to be sure 3189 that the object gets properly destructed. An easy way to test this is to 3190 return a C++ class with a private destructor: semantic analysis should 3191 flag an error here with the attempt to call the destructor. 3192 * Inspect the generated AST by printing it using ``clang -cc1 -ast-print``, 3193 to make sure you're capturing all of the important information about how 3194 the AST was written. 3195 * Inspect the generated AST under ``clang -cc1 -ast-dump`` to verify that 3196 all of the types in the generated AST line up the way you want them. 3197 Remember that clients of the AST should never have to "think" to 3198 understand what's going on. For example, all implicit conversions should 3199 show up explicitly in the AST. 3200 * Write tests that use your expression as a subexpression of other, 3201 well-known expressions. Can you call a function using your expression as 3202 an argument? Can you use the ternary operator? 3203 3204#. Teach code generation to create IR to your AST node. This step is the first 3205 (and only) that requires knowledge of LLVM IR. There are several things to 3206 keep in mind: 3207 3208 * Code generation is separated into scalar/aggregate/complex and 3209 lvalue/rvalue paths, depending on what kind of result your expression 3210 produces. On occasion, this requires some careful factoring of code to 3211 avoid duplication. 3212 * ``CodeGenFunction`` contains functions ``ConvertType`` and 3213 ``ConvertTypeForMem`` that convert Clang's types (``clang::Type*`` or 3214 ``clang::QualType``) to LLVM types. Use the former for values, and the 3215 latter for memory locations: test with the C++ "``bool``" type to check 3216 this. If you find that you are having to use LLVM bitcasts to make the 3217 subexpressions of your expression have the type that your expression 3218 expects, STOP! Go fix semantic analysis and the AST so that you don't 3219 need these bitcasts. 3220 * The ``CodeGenFunction`` class has a number of helper functions to make 3221 certain operations easy, such as generating code to produce an lvalue or 3222 an rvalue, or to initialize a memory location with a given value. Prefer 3223 to use these functions rather than directly writing loads and stores, 3224 because these functions take care of some of the tricky details for you 3225 (e.g., for exceptions). 3226 * If your expression requires some special behavior in the event of an 3227 exception, look at the ``push*Cleanup`` functions in ``CodeGenFunction`` 3228 to introduce a cleanup. You shouldn't have to deal with 3229 exception-handling directly. 3230 * Testing is extremely important in IR generation. Use ``clang -cc1 3231 -emit-llvm`` and `FileCheck 3232 <https://llvm.org/docs/CommandGuide/FileCheck.html>`_ to verify that you're 3233 generating the right IR. 3234 3235#. Teach template instantiation how to cope with your AST node, which requires 3236 some fairly simple code: 3237 3238 * Make sure that your expression's constructor properly computes the flags 3239 for type dependence (i.e., the type your expression produces can change 3240 from one instantiation to the next), value dependence (i.e., the constant 3241 value your expression produces can change from one instantiation to the 3242 next), instantiation dependence (i.e., a template parameter occurs 3243 anywhere in your expression), and whether your expression contains a 3244 parameter pack (for variadic templates). Often, computing these flags 3245 just means combining the results from the various types and 3246 subexpressions. 3247 * Add ``TransformXXX`` and ``RebuildXXX`` functions to the ``TreeTransform`` 3248 class template in ``Sema``. ``TransformXXX`` should (recursively) 3249 transform all of the subexpressions and types within your expression, 3250 using ``getDerived().TransformYYY``. If all of the subexpressions and 3251 types transform without error, it will then call the ``RebuildXXX`` 3252 function, which will in turn call ``getSema().BuildXXX`` to perform 3253 semantic analysis and build your expression. 3254 * To test template instantiation, take those tests you wrote to make sure 3255 that you were type checking with type-dependent expressions and dependent 3256 types (from step #2) and instantiate those templates with various types, 3257 some of which type-check and some that don't, and test the error messages 3258 in each case. 3259 3260#. There are some "extras" that make other features work better. It's worth 3261 handling these extras to give your expression complete integration into 3262 Clang: 3263 3264 * Add code completion support for your expression in 3265 ``SemaCodeComplete.cpp``. 3266 * If your expression has types in it, or has any "interesting" features 3267 other than subexpressions, extend libclang's ``CursorVisitor`` to provide 3268 proper visitation for your expression, enabling various IDE features such 3269 as syntax highlighting, cross-referencing, and so on. The 3270 ``c-index-test`` helper program can be used to test these features. 3271